Extract the text from URLs using TIKA

Question

Is it possible to extract text from URLs with Tika? Any links will be appreciated. Or TIKA is usable only for pdf, word and any other media documents?

score 7 · Answer 1 · answered Jul 11 '11 at 21:40

7

Check the documentation - yes you can.

Example

java -jar tika-app-0.9.jar -t http://stackoverflow.com/questions/6656849/extract-the-text-from-url-using-tika

will show you the text on this page.

answered Jul 11 '11 at 21:40

fvu

32,488
6
61
79

And if I need to use this in a Java code and save the text from url in a text file.. Then it is also possible..?? And I am not using maven. I want to use this in java code.. – arsenal Jul 11 '11 at 21:44
1

the description how to use tika with ant is just below the description of how to use it with Maven, and just above the instructions for the command line tool. If you need some inspiration on how to embed it, I'm certain there's info on the website, and there's always the source of the command line tool as well. – fvu Jul 11 '11 at 21:47

score 6 · Accepted Answer · edited Feb 05 '13 at 15:26

This is from lucid:

InputStream input = new FileInputStream(new File(resourceLocation));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();
parser.parse(input, textHandler, metadata);
input.close();
out.println("Title: " + metadata.get("title"));
out.println("Author: " + metadata.get("Author"));
out.println("content: " + textHandler.toString());

Instead of creating a PDFParser you can use Tika's AutoDetectParser to automatically process diff types of files:

Parser parser = new AutoDetectParser();

score 3 · Answer 3 · answered Mar 25 '12 at 20:40

Yes, you can do that. Here is the code. This code uses apache http client

HttpGet httpget = new HttpGet("http://url.here"); 
    HttpEntity entity = null;
    HttpClient client = new DefaultHttpClient();
    HttpResponse response = client.execute(httpget);
    entity = response.getEntity();
    if (entity != null) {
        InputStream instream = entity.getContent();
        BodyContentHandler handler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        Parser parser = new AutoDetectParser();
        parser.parse( instream, handler, metadata, new ParseContext());
        String plainText = handler.toString();
        FileWriter writer = new FileWriter( "/scratch/cache/output.txt");
        writer.write( plainText );
        writer.close();
        System.out.println( "done");
    }

score 1 · Answer 4 · answered Feb 14 '12 at 06:52

1

to extract content from URL not from local file use this code:

    byte[] raw = content.getContent();
    ContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    Parser parser = new AutoDetectParser();
    parser.parse(new ByteArrayInputStream(raw), handler, metadata, new ParseContext());
    LOG.info("content: " + handler.toString());

answered Feb 14 '12 at 06:52

Haya aziz

300
1
3
16

You can also use TikaInputStream.get(byte[]) to build the InputStream – Gagravarr Feb 14 '12 at 10:13

Extract the text from URLs using TIKA

4 Answers4