Is it possible to extract text from URLs with Tika? Any links will be appreciated. Or TIKA is usable only for pdf, word and any other media documents?
Asked
Active
Viewed 7,537 times
4 Answers
7
Check the documentation - yes you can.
Example
java -jar tika-app-0.9.jar -t http://stackoverflow.com/questions/6656849/extract-the-text-from-url-using-tika
will show you the text on this page.

fvu
- 32,488
- 6
- 61
- 79
-
And if I need to use this in a Java code and save the text from url in a text file.. Then it is also possible..?? And I am not using maven. I want to use this in java code.. – arsenal Jul 11 '11 at 21:44
-
1the description how to use tika with ant is just below the description of how to use it with Maven, and just above the instructions for the command line tool. If you need some inspiration on how to embed it, I'm certain there's info on the website, and there's always the source of the command line tool as well. – fvu Jul 11 '11 at 21:47
6
This is from lucid:
InputStream input = new FileInputStream(new File(resourceLocation));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();
parser.parse(input, textHandler, metadata);
input.close();
out.println("Title: " + metadata.get("title"));
out.println("Author: " + metadata.get("Author"));
out.println("content: " + textHandler.toString());
Instead of creating a PDFParser
you can use Tika's AutoDetectParser
to automatically process diff types of files:
Parser parser = new AutoDetectParser();

Johan Haest
- 4,391
- 28
- 37

surajz
- 3,471
- 3
- 32
- 38
3
Yes, you can do that. Here is the code. This code uses apache http client
HttpGet httpget = new HttpGet("http://url.here");
HttpEntity entity = null;
HttpClient client = new DefaultHttpClient();
HttpResponse response = client.execute(httpget);
entity = response.getEntity();
if (entity != null) {
InputStream instream = entity.getContent();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
parser.parse( instream, handler, metadata, new ParseContext());
String plainText = handler.toString();
FileWriter writer = new FileWriter( "/scratch/cache/output.txt");
writer.write( plainText );
writer.close();
System.out.println( "done");
}

jeremyvillalobos
- 1,795
- 2
- 19
- 39
1
to extract content from URL not from local file use this code:
byte[] raw = content.getContent();
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
parser.parse(new ByteArrayInputStream(raw), handler, metadata, new ParseContext());
LOG.info("content: " + handler.toString());

Haya aziz
- 300
- 1
- 3
- 16
-
You can also use TikaInputStream.get(byte[]) to build the InputStream – Gagravarr Feb 14 '12 at 10:13