0

I'm using Apache Tika server in a Docker container to parse all sorts of files. I noticed that when sending a broken PDF to parse, Tika returned 200 and empty text. I added this line to my config.xml:

<returnStackTrace>true</returnStackTrace>

Which then made Tika still return 200, but with the stack trace:

org.apache.tika.exception.TikaException: Unable to extract PDF content
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:119)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:174)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:185)
        at org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:347)
        at org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:497)
        at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177)
        at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1543)
        at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249)
        at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122)
        at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84)
        at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
        at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90)
        at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
        at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
        at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
        at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
        at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1434)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1349)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
        at org.eclipse.jetty.server.Server.handle(Server.java:516)
        at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:400)
        at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:645)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:392)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
        at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
        at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
        at java.base/java.lang.Thread.run(Thread.java:833)

This is problematic, because I'm relying on the status code returned from Tika to determine if the text extraction worked or not. I would like to make Tika return 500 when it fails to parse a file, but I couldn't find any configuration value to put in config.xml to make it so.

  • Not sure what endpoint you're using. The challenge is that Tika server was initially designed to stream results, which means that it had to return a status before parsing the full file, which means that it could return 200, start streaming results and then hit a parse exception. See https://cwiki.apache.org/confluence/display/TIKA/TikaServerEndpointsCompared for a table of differences in how Tika reports these problems. – Tim Allison May 03 '23 at 14:06

0 Answers0