0

Hi Tika Server is setup with tesseract but still it is not reading embedded images in PDFs. Tried using the two headers available, but not help.

This is happening for PDF files only. While, OCR works for other file types/images.

Using customized docker container here. Oddly, the same container deployed in another machine works. Is there any possibility of lower level issue?

Update: After comparing logs, it seems OCP is lowercasing the custom HTTP headers like X-Tika..., Postman-Token to x-tika..., postman-token etc. Can anyone help me on what could be the possible issue?

S. Das
  • 93
  • 2
  • 10
  • 1
    One of the points of Docker containers is that they come batteries-included, and run the same everywhere.... Are you sure you're running the same containers on both machines, with the same environment variables passed in? – Gagravarr Mar 11 '21 at 11:22
  • Yes. Though one is running on Kubernetes, and one in OCP. And no extra environment variable. – S. Das Mar 11 '21 at 22:17

2 Answers2

0

It seems that OCP lowercasing the custom headers are reason for the issue. TikaServer 1.25 does not support case insensitive X-Tika headers.

I have fixed it in Tika Server 1.26. Ref: https://tika.apache.org/1.26/index.html https://issues.apache.org/jira/browse/TIKA-3320

S. Das
  • 93
  • 2
  • 10
-1

Check the https://tika.apache.org/1.24/api/org/apache/tika/parser/pdf/PDFParserConfig.html

pdfParserConfig.setExtractInlineImages(true);
pdfParserConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_ONLY);`

The pdf scanned document is converted to the image and then send to the tesseract

marek.kapowicki
  • 674
  • 2
  • 5
  • 17
  • This answer doesn't directly respond to the tika-server point, and it conflates the two strategies: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066#PDFParser(ApachePDFBox)-OCR – Tim Allison Mar 11 '21 at 19:57