6

I am using Apache Tika Parser to parse PDF files into text. Some PDFs could contain scanned documents. Apache Tika uses Tesseract to recognize a text into images. But there is no jar library with Tesseract and user should install Tesseract as independent application in operation system. How can I use Tesseract from Apache Tika without installing Tesseract? I tried to add tesseract folder to classpath and configure like below:

TesseractOCRConfig config = new TesseractOCRConfig();
config.setTesseractPath("tesseract");
config.setTessdataPath("tesseract/tessdata");

PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_AND_TEXT_EXTRACTION);
ParseContext parseContext = new ParseContext();
parseContext.set(PDFParserConfig.class, pdfConfig);

but I got:

org.apache.commons.io.IOExceptionWithCause: Unable to end a page
    at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:428)
    at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
    at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
    at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
    at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
    at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
    ... 43 common frames omitted
Caused by: org.apache.tika.exception.TikaException: Tesseract is not available. Please set the OCR_STRATEGY to NO_OCR or configure Tesseract correctly
    at org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:321)
    at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
    ... 49 common frames omitted

also I tried to use tess4j library, which consumes File, but I need to parse from InputStream without caching into hard drive. Could anyone please help me to configure Apache Tika and Tesseract?

Nox
  • 191
  • 1
  • 11
  • 2
    Tess4J can consume `BufferedImage`, so from your `InputStream`: `BufferedImage image = ImageIO.read(inputStream);` – nguyenq Sep 16 '17 at 16:57
  • @nguyenq good idea, thanks, but I got `java.lang.IllegalArgumentException: image == null` for the pdf file – Nox Sep 18 '17 at 14:07
  • PDF is a document format, not an image. You'll need to convert it first. – nguyenq Sep 18 '17 at 17:13

1 Answers1

0

Tika uses under the hood the google tesseract for ocring the text. tess4j does not help with this, it is also the tesseract wrapper There are 2 possible solutions

  1. Install the tesseract on machine
  2. Use the TikaServer- https://cwiki.apache.org/confluence/display/TIKA/TikaServer - it contains the tesseract inside and exposes the rest api for ocr

If You want You can check my sample java project that uses the tika server: https://github.com/marekkapowicki/nlp

marek.kapowicki
  • 674
  • 2
  • 5
  • 17