0

I am having trouble understanding how parsers are loaded into Tika. From their documentation it appears that Tika-app comes prepackaged with the parsers (https://tika.apache.org/1.17/gettingstarted.html). When I run this command to start the server though

    ./.java-buildpack/open_jdk_jre/bin/java -jar ./lib/tika-app-1.24.1.jar -s --port ${PORT}

    2020-11-02T13:30:26.04-0600 [APP/PROC/WEB/0] ERR Nov 02, 2020 7:30:26 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
   2020-11-02T13:30:26.04-0600 [APP/PROC/WEB/0] ERR WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
   2020-11-02T13:30:26.04-0600 [APP/PROC/WEB/0] ERR See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
   2020-11-02T13:30:26.04-0600 [APP/PROC/WEB/0] ERR for optional dependencies.
   2020-11-02T13:30:26.53-0600 [APP/PROC/WEB/0] ERR Nov 02, 2020 7:30:26 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
   2020-11-02T13:30:26.53-0600 [APP/PROC/WEB/0] ERR WARNING: org.xerial's sqlite-jdbc is not loaded.
   2020-11-02T13:30:26.53-0600 [APP/PROC/WEB/0] ERR Please provide the jar on your classpath to parse sqlite files.
   2020-11-02T13:30:26.53-0600 [APP/PROC/WEB/0] ERR See tika-parsers/pom.xml for the correct version.
   2020-11-02T13:30:26.80-0600 [APP/PROC/WEB/0] OUT Successfully started tika-app's server on port: 8080
   2020-11-02T13:30:26.80-0600 [APP/PROC/WEB/0] ERR WARNING: The server option in tika-app is deprecated and will be removed
   2020-11-02T13:30:26.80-0600 [APP/PROC/WEB/0] ERR by Tika 2.0 if not shortly after Tika 1.14.
   2020-11-02T13:30:26.80-0600 [APP/PROC/WEB/0] ERR Please migrate to the JAX-RS tika-server package.
   2020-11-02T13:30:26.80-0600 [APP/PROC/WEB/0] ERR See https://wiki.apache.org/tika/TikaJAXRS for usage.
   2020-11-02T13:31:25.66-0600 [HEALTH/0] ERR Failed to make HTTP request to '/version' on port 8080: timed out after 1.00 seconds
   2020-11-02T13:31:25.66-0600 [CELL/0] ERR Timed out after 1m0s: health check never passed.

I have the most recent tika version 1.24.1. Their documentation mentions downloading tika-server and passing classpath at runtime to point to a tika-parsers.jar (https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-ParsersMissing) but I can't find the parsers.jar file anywhere. I am using openjdk-jre-1.8.0 to run this.

mlanier
  • 167
  • 2
  • 3
  • 14
  • Are you having trouble with getting content back? Or just wondering about the warnings about missing additional native dependencies for some parsers? – Gagravarr Nov 03 '20 at 12:42
  • I am not getting any content back when using python and connecting to this server. It always gives me an empty parser error. – mlanier Nov 05 '20 at 19:57

1 Answers1

0

The parsers should be bundled by default. Tika App in server mode (-s) is socket based server. You can confirm it is working by using netcat and seeing if you get a response:

nc localhost 8080 -q2 < test.pdf

To use this in Python you would need to write custom code open a socket and send the input in, send a SHUT_WR, and read the output back.

If you are using tika-python library, it is expecting to use a Tika Server which is in the tika-server JAR not the tika-app JAR. It has some helper settings so you can point to the JAR, or you can host your own instance (self run or docker) and give it the URL.

Dave Meikle
  • 226
  • 2
  • 5