7

Is it possible to extract text from URLs with Tika? Any links will be appreciated. Or TIKA is usable only for pdf, word and any other media documents?

palacsint
  • 28,416
  • 10
  • 82
  • 109
arsenal
  • 23,366
  • 85
  • 225
  • 331

4 Answers4

7

Check the documentation - yes you can.

Example

java -jar tika-app-0.9.jar -t http://stackoverflow.com/questions/6656849/extract-the-text-from-url-using-tika

will show you the text on this page.

fvu
  • 32,488
  • 6
  • 61
  • 79
  • And if I need to use this in a Java code and save the text from url in a text file.. Then it is also possible..?? And I am not using maven. I want to use this in java code.. – arsenal Jul 11 '11 at 21:44
  • 1
    the description how to use tika with ant is just below the description of how to use it with Maven, and just above the instructions for the command line tool. If you need some inspiration on how to embed it, I'm certain there's info on the website, and there's always the source of the command line tool as well. – fvu Jul 11 '11 at 21:47
6

This is from lucid:

InputStream input = new FileInputStream(new File(resourceLocation));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();
parser.parse(input, textHandler, metadata);
input.close();
out.println("Title: " + metadata.get("title"));
out.println("Author: " + metadata.get("Author"));
out.println("content: " + textHandler.toString());

Instead of creating a PDFParser you can use Tika's AutoDetectParser to automatically process diff types of files:

Parser parser = new AutoDetectParser();
Johan Haest
  • 4,391
  • 28
  • 37
surajz
  • 3,471
  • 3
  • 32
  • 38
3

Yes, you can do that. Here is the code. This code uses apache http client

HttpGet httpget = new HttpGet("http://url.here"); 
    HttpEntity entity = null;
    HttpClient client = new DefaultHttpClient();
    HttpResponse response = client.execute(httpget);
    entity = response.getEntity();
    if (entity != null) {
        InputStream instream = entity.getContent();
        BodyContentHandler handler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        Parser parser = new AutoDetectParser();
        parser.parse( instream, handler, metadata, new ParseContext());
        String plainText = handler.toString();
        FileWriter writer = new FileWriter( "/scratch/cache/output.txt");
        writer.write( plainText );
        writer.close();
        System.out.println( "done");
    }
jeremyvillalobos
  • 1,795
  • 2
  • 19
  • 39
1

to extract content from URL not from local file use this code:

    byte[] raw = content.getContent();
    ContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    Parser parser = new AutoDetectParser();
    parser.parse(new ByteArrayInputStream(raw), handler, metadata, new ParseContext());
    LOG.info("content: " + handler.toString());
Haya aziz
  • 300
  • 1
  • 3
  • 16