Tika Server not reading embedded images in PDFs

Question

Hi Tika Server is setup with tesseract but still it is not reading embedded images in PDFs. Tried using the two headers available, but not help.

This is happening for PDF files only. While, OCR works for other file types/images.

Using customized docker container here. Oddly, the same container deployed in another machine works. Is there any possibility of lower level issue?

Update: After comparing logs, it seems OCP is lowercasing the custom HTTP headers like X-Tika..., Postman-Token to x-tika..., postman-token etc. Can anyone help me on what could be the possible issue?

One of the points of Docker containers is that they come batteries-included, and run the same everywhere.... Are you sure you're running the same containers on both machines, with the same environment variables passed in? — Gagravarr, Mar 11 '21 at 11:22
Yes. Though one is running on Kubernetes, and one in OCP. And no extra environment variable. — S. Das, Mar 11 '21 at 22:17

score 0 · Accepted Answer · answered Mar 30 '21 at 06:18

It seems that OCP lowercasing the custom headers are reason for the issue. TikaServer 1.25 does not support case insensitive X-Tika headers.

I have fixed it in Tika Server 1.26. Ref: https://tika.apache.org/1.26/index.html https://issues.apache.org/jira/browse/TIKA-3320

score -1 · Answer 2 · answered Mar 11 '21 at 14:10

-1

Check the https://tika.apache.org/1.24/api/org/apache/tika/parser/pdf/PDFParserConfig.html

pdfParserConfig.setExtractInlineImages(true);
pdfParserConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_ONLY);`

The pdf scanned document is converted to the image and then send to the tesseract

answered Mar 11 '21 at 14:10

marek.kapowicki

674
2
5
17

This answer doesn't directly respond to the tika-server point, and it conflates the two strategies: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066#PDFParser(ApachePDFBox)-OCR – Tim Allison Mar 11 '21 at 19:57

Tika Server not reading embedded images in PDFs

2 Answers2