I'm using Apache Tika server in a Docker container to parse all sorts of files. I noticed that when sending a broken PDF to parse, Tika returned 200
and empty text. I added this line to my config.xml
:
<returnStackTrace>true</returnStackTrace>
Which then made Tika still return 200
, but with the stack trace:
org.apache.tika.exception.TikaException: Unable to extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:119)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:174)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:185)
at org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:347)
at org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:497)
at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177)
at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1543)
at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249)
at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122)
at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84)
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90)
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1434)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1349)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.server.Server.handle(Server.java:516)
at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:400)
at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:645)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:392)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
at java.base/java.lang.Thread.run(Thread.java:833)
This is problematic, because I'm relying on the status code returned from Tika to determine if the text extraction worked or not. I would like to make Tika return 500 when it fails to parse a file, but I couldn't find any configuration value to put in config.xml
to make it so.