0

I'm trying to setup Tika for text extraction using python. I've installed Java runtime jre 1.8.0, Installed tika with pip install tika==1.23, Downloaded the tika server jar file from this link, and as mentioned in this page, I've added variable TIKA_SERVER_JAR="..tika-server-1.9.jar" to the system environment variables. I started the tika server with the command java -jar "..tika-server-1.9.jar" and I got something like below

C:\Users\Administrator>java -jar "C:\Program Files\Java\tika-server-1.9.jar"
Mar 02, 2021 4:29:07 PM org.apache.tika.server.TikaServerCli main
INFO: Starting Apache Tika 1.9 server
Mar 02, 2021 4:29:08 PM org.apache.cxf.endpoint.ServerImpl initDestination
INFO: Setting the server's publish address to be http://localhost:9998/
Mar 02, 2021 4:29:08 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: jetty-8.y.z-SNAPSHOT
Mar 02, 2021 4:29:08 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Started SelectChannelConnector@localhost:9998
Mar 02, 2021 4:29:08 PM org.apache.tika.server.TikaServerCli main
INFO: Started

When I open http://localhost:9998/ in the browser it shows me the Tika API Documentation.

But when I attempt to extract text with python as shown below.

import tika
from tika import parser
tika.initVM()

text = parser.from_file(r"..somefile.doc")
print(text)

tika doesn't work as intended. It is raising an exception like below. This is what I see on the console and nothing else.

2021-03-02 16:31:03,037 [MainThread  ] [WARNI]  Tika server returned status: 404

I once used tika with python successfully a few months back and I'm clueless about what I'm missing now.

EDITED: When I run the python snippet above, I can see verbose like below in the console.

Mar 03, 2021 9:37:08 AM org.apache.cxf.jaxrs.utils.JAXRSUtils 
findTargetMethod
WARNING: No operation matching request path "/rmeta/text" is found, Relative         
Path: /text, HTTP Method: PUT, ContentType: */*, Accept: application/json,. 
Please enable FINE/TRACE log level for more details.
Mar 03, 2021 9:37:08 AM 
org.apache.cxf.jaxrs.impl.WebApplicationExceptionMapper toResponse
WARNING: javax.ws.rs.ClientErrorException: HTTP 404 Not Found
    at org.apache.cxf.jaxrs.utils.SpecExceptions.toHttpException(SpecExceptions.java:117)
    at org.apache.cxf.jaxrs.utils.ExceptionUtils.toHttpException(ExceptionUtils.java:166)
    at org.apache.cxf.jaxrs.utils.JAXRSUtils.findTargetMethod(JAXRSUtils.java:526)
    at org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.processRequest(JAXRSInInterceptor.java:177)
    at org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.handleMessage(JAXRSInInterceptor.java:77)
    at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
    at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
    at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
    at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
    at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
    at org.eclipse.jetty.server.Server.handle(Server.java:370)
    at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
    at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
    at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
    at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
    at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
    at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
    at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
    at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
    at java.lang.Thread.run(Unknown Source)

This is what I see on the console every time I run the python script to extract text.

Venkatesh Dharavath
  • 500
  • 1
  • 5
  • 18
  • Why are you starting such an old version of the Apache Tika Server jar? What happens when you fix your `TIKA_SERVER_JAR` variable to refer to a recent one? – Gagravarr Mar 02 '21 at 19:15
  • @Gagravarr I'm facing the same issue with updated versions too, I tried with tika-server-1.9 also. – Venkatesh Dharavath Mar 03 '21 at 09:19
  • Apache Tika 1.9 was released in 2015! Try something a little bit more modern... – Gagravarr Mar 03 '21 at 12:42
  • I went straight to the official [page](https://tika.apache.org/download.html) of Apache tika, it shows latest stable version is tika-1.25. – Venkatesh Dharavath Mar 03 '21 at 14:02
  • So use that then! Stop using 7+ year old versions of the software and being surprised there are issues... – Gagravarr Mar 03 '21 at 17:50
  • Three months ago I used the same jar file `tika-server-1.9.jar` and it worked for me. And I tried with multiple versions of the jar file, but still, I get the same problem. Please go through the question once again, I've edited the question a bit just in case you get any idea what I'm missing. – Venkatesh Dharavath Mar 04 '21 at 10:49
  • You need to use matching versions of the Tika Server jar and the Tika python wrapper. The latest version is very much recommended! You are seemingly trying to use a very recent version of the Python wrapper to talk to a 7 year old version of the Server, which is unlikely to work as 7 years ago the server hadn't had many of the endpoints added... – Gagravarr Mar 04 '21 at 12:25
  • Okay I noted that point, and please check the tika's official page, it shows the latest stable version as `tika-server-1.25` which is older than `tika-server-1.9`. – Venkatesh Dharavath Mar 04 '21 at 14:56
  • 9 < 25, Apache Tika 1.9 = 1.09 was released in 2015, Apache Tika 1.25 (25th subrelease of 1) was released very recently – Gagravarr Mar 04 '21 at 17:03

0 Answers0