9

i'm having some troubles using Apache TIKA (version 1.10). I got some PDF files which are just scanned pieces of paper. That means each page is just an image. My goal is to extract the text of the PDF files anyway.

My tesseract is set up correctly and extracting JPG and PNG files works like a charm. The code i'm using looks like that (don't mind the missing excetion handling):

public String extractText(InputStream stream) {
    AutoDetectParser parser = new AutoDetectParser();
    BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    parser.parse(stream, handler, metadata, context);
    String text = handler.toString();
    return text;
}

I searched a lot but i didn't find any solutions that work for me. I already tried the setExtractInlineImages method of the PDFParserConfig class but this didn't change a thing. Extracting embedded documents using a custom ParsingEmbeddedDocumentExtractor did extract embedded resources of a doc file but not for my PDF files.

It would be awesome if anyone of you could provide some help :)

LorisBachert
  • 283
  • 1
  • 2
  • 12
  • Did you attach a `PDFParserConfig` to the context with that option set? – Gagravarr Sep 02 '15 at 17:13
  • Yes, i did. But this had no effect :/ – LorisBachert Sep 03 '15 at 05:16
  • Can you post the code you used to do that, so we can check if it's correct? – Gagravarr Sep 03 '15 at 08:02
  • `PDFParserConfig config = new PDFParserConfig();` `config.setExtractInlineImages(true);` `ParseContext context = new ParseContext();` `context.set(PDFParserConfig.class, config);` `PDFParser pdfParser = new PDFParser();` `pdfParser.setPDFParserConfig(config);` `pdfParser.parse(stream, handler, metadata, context);` There you go, thanks for the help so far :) – LorisBachert Sep 03 '15 at 08:53
  • Does running the Tika App with the `-z` (extract) flag get the scanned images out of the file? – Gagravarr Sep 03 '15 at 09:16
  • Sadly it doesn't. BTW: I'm using the PDF mentioned in the TIKA Ticket about OCR Embedded Images which you can find here: [Ticket](https://issues.apache.org/jira/browse/TIKA-93), [PDF](https://issues.apache.org/jira/secure/attachment/12627866/testOCR.pdf) – LorisBachert Sep 03 '15 at 09:26
  • I'd suggest you raise a new Tika JIRA then, and refer to that file + what you've tried + a unit test that shows the issue. You seem to have done everything that I'd expect you to need to have done! – Gagravarr Sep 03 '15 at 09:30
  • I created a ticket in the official Apache TIKA-JIRA. Everyone interested on updates can take a look [here](https://issues.apache.org/jira/browse/TIKA-1729). – LorisBachert Sep 03 '15 at 09:54
  • Is it working for you without Tesseract being installed ? – Bilal BBB Sep 12 '15 at 16:11
  • No, it needs Tesseract. – LorisBachert Sep 13 '15 at 18:47
  • It is better to write your solution here so everybody could use it. – Bilal BBB Sep 14 '15 at 09:57

1 Answers1

14

Tim Allison brought the solution:

Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

TesseractOCRConfig config = new TesseractOCRConfig();
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);

ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
parseContext.set(Parser.class, parser); //need to add this to make sure recursive parsing happens!

parser.parse(stream, handler, new Metadata(), parseContext);

This works for me :)

EDIT: Here is the complete solution:

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.ocr.TesseractOCRConfig;
import org.apache.tika.parser.pdf.PDFParserConfig;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

import java.io.FileInputStream;
import java.io.IOException;

/**
 * @since 8/26/16
 */
public class Sample {
    public static void main(String[] args)
            throws IOException, TikaException, SAXException {
        Parser parser = new AutoDetectParser();
        BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

        TesseractOCRConfig config = new TesseractOCRConfig();
        PDFParserConfig pdfConfig = new PDFParserConfig();
        pdfConfig.setExtractInlineImages(true);

        ParseContext parseContext = new ParseContext();
        parseContext.set(TesseractOCRConfig.class, config);
        parseContext.set(PDFParserConfig.class, pdfConfig);
        //need to add this to make sure recursive parsing happens!
        parseContext.set(Parser.class, parser);

        FileInputStream stream = new FileInputStream("samplepdf.pdf");
        Metadata metadata = new Metadata();
        parser.parse(stream, handler, metadata, parseContext);
        System.out.println(metadata);
        String content = handler.toString();
        System.out.println("===============");
        System.out.println(content);
        System.out.println("Done");
    }
}

Maven Dependencies:

<dependencies>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-parsers</artifactId>
      <version>1.13</version>
    </dependency>
    <dependency>
      <groupId>com.levigo.jbig2</groupId>
      <artifactId>levigo-jbig2-imageio</artifactId>
      <version>1.6.5</version>
    </dependency>
  </dependencies>
Thamme Gowda
  • 11,249
  • 5
  • 50
  • 57
LorisBachert
  • 283
  • 1
  • 2
  • 12
  • 1
    I have tried the solution and followed Apache Tika-Jira but its not working. I am not getting any error but output is empty. – Rana Sep 23 '16 at 12:00
  • 1
    My issue got solved. Follow : http://stackoverflow.com/questions/39762841/unable-to-extract-scanned-pdf-using-tesseractocrconfig-apache-tika/39792337#39792337 – Rana Sep 30 '16 at 13:12
  • Thamme, thank you for this. Please update to include the following dependency (thanks to Rana's link above) and a warning about licensing implications of levigo and jai. com.github.jai-imageio jai-imageio-core 1.3.1 – Tim Allison Mar 27 '17 at 16:44
  • Hi I used above code and I found that there is no difference in extract result whether i inclued tesseract or not. can you tell me why tesseract is being used. Thanks in advance. – charmi Jun 15 '17 at 11:18