Tika 1.1 Performance Improvement

Asked Dec 23 '13 at 14:43

Active Dec 24 '13 at 05:29

Viewed 831 times

I am using tika 1.1, I am facing issue that tika is taking long time for extracting the content from file. For extracting 1MB of pdf/doc file it taking time around ~3Second. Is there any way to improve performance ? Any tuning ,configuration which helps to increase the performance.

I have tried tika 1.4 but unfortunately for same pdf time is ~3.2 Second.

I am using BodyContentHandler.

public class TikkaExtractor {
public static void main(String[] args) throws Exception {
    BodyContentHandler handler = new BodyContentHandler(10000);
    Metadata metadata = new Metadata();
    Parser parser = new AutoDetectParser();
    InputStream content = TikkaExtractor.class.getResourceAsStream("demo.pdf");
    parser.parse(content, handler, metadata, new ParseContext());
    ContentHandlerDecorator contentHandlerDecorator = new ContentHandlerDecorator(handler);
    String s = contentHandlerDecorator.toString();
    content.close();
}

}

edited Dec 24 '13 at 05:29

asked Dec 23 '13 at 14:43

Chetan Laddha

How are you calling Tika? And how are you getting the text out? (Plain text, xhtml, custom sax handler etc?) – Gagravarr Dec 23 '13 at 23:40
That's a bit hard to read in a comment - any chance you could edit your question to include the code in a nicely formatted code block? – Gagravarr Dec 24 '13 at 05:25
Is your file protected in any way? And how many pages does it have in it? – Gagravarr Jan 02 '14 at 11:25
file is not protected. It is around 1MB of file. – Chetan Laddha Jan 07 '14 at 08:31
can someone please suggest how i can improve performance of content extraction ? – Chetan Laddha Jan 13 '14 at 10:20
Try profiling it, and see where the time goes. Quite possibly you'll need to report a bug to the Apache PDFBox project, which is what Apache Tika uses internally for the PDF parts – Gagravarr Jan 13 '14 at 15:30

Tika 1.1 Performance Improvement

0 Answers0