I am using tika 1.1, I am facing issue that tika is taking long time for extracting the content from file. For extracting 1MB of pdf/doc file it taking time around ~3Second. Is there any way to improve performance ? Any tuning ,configuration which helps to increase the performance.
I have tried tika 1.4 but unfortunately for same pdf time is ~3.2 Second.
I am using BodyContentHandler.
public class TikkaExtractor {
public static void main(String[] args) throws Exception {
BodyContentHandler handler = new BodyContentHandler(10000);
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
InputStream content = TikkaExtractor.class.getResourceAsStream("demo.pdf");
parser.parse(content, handler, metadata, new ParseContext());
ContentHandlerDecorator contentHandlerDecorator = new ContentHandlerDecorator(handler);
String s = contentHandlerDecorator.toString();
content.close();
}
}