0

I have been trying to get Apache-Tika to work with this python package: https://github.com/chrismattmann/tika-python

I have the following code in my python program:

#!/usr/bin/env python
import tika
tika.initVM()
from tika import parser
parsed = parser.from_file('pdf/myPdf.pdf')

But I get a 422 response every time:

[MainThread  ] [WARNI]  Failed to see startup log message; retrying...
[MainThread  ] [WARNI]  Tika server returned status: 422

Apache Tika does work when I use the following command:

java -jar tika-app-1.18.jar -t pdf/alnaggar2016lattice.pdf 

I really would like to fix this error with the Tika-Python package because it would be a lot easier for the rest of the project if this would work.

Ryan Fasching
  • 449
  • 2
  • 11
  • 21
  • The Tika Python library is a wrapper around the Tika Server jar, rather than the CLI Tika App jar. Does the Tika Server jar start properly for you? – Gagravarr Sep 06 '18 at 05:02
  • I have tried the command above with the tika jar file and it didn't give me any errors when I ran it. @Gagravarr – Ryan Fasching Sep 06 '18 at 18:20
  • Tika App != Tika Server - what happens if you try to start the *Tika Server* standalone? – Gagravarr Sep 06 '18 at 18:35
  • When I run the server jar file with this command: `java -jar tika-server-1.18.jar -h 0.0.0.0` It doesn't give me any errors and starts up fine. @Gagravarr – Ryan Fasching Sep 07 '18 at 05:45
  • 1
    Best ask on the Tika Users list, the main maintainer of the Tika Python integration tends not to be on StackOverflow much, but is on the Tika mailing lists! – Gagravarr Sep 07 '18 at 06:20
  • Http: 422 code is interesting and I get it. a lot when putting files through Tika that just contain images - So a PDF with an image, Tika knows the file type, but cannot process the file, as it's an image. What I have done to circumvent this is to use headers: headers = {'X-Tika-PDFextractInlineImages': 'true',} data = parser.from_file(pathtofile, serverEndpoint=self.TIKA_SERVER, headers=headers) – Hairy Dec 05 '18 at 08:10
  • 1
    So the issue was the PDF was corrupted – Ryan Fasching Dec 05 '18 at 22:02
  • tika is one of the worst projects I have ever seen, lacking or incorrect documentation, constant errors at runtime, very slow. Unnecessarily verbose and incomprehensible source code. Think about it 10 times before using it. I strongly advise against. – Xilmiki Jul 04 '21 at 18:25

0 Answers0