0

I want Tika to parse only zip files and pdf files.

With the following tika_config.xml:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.pkg.PackageParser"/>
    <parser class="org.apache.tika.parser.pdf.PDFParser"/>
  </parsers>
</properties>

launching tika-server 1.17:

java -jar tika-1.17-src/tika-1.17/tika-server/target/tika-server-1.17.jar --config tika_config.xml -enableUnsecureFeatures -enableFileUrl

submitting a zip files composed of pdf ant txt:

curl -H "fileUrl:file:///home/[...]/mixed.zip" -X PUT http://localhost:9998/rmeta/text --header "Accept: application/json" > output.txt

I get

[
  {
    "Content-Type": "application/zip",
    "X-Parsed-By": [
      "org.apache.tika.parser.CompositeParser",
      "org.apache.tika.parser.pkg.PackageParser"
    ],
    "X-TIKA:content": "\n\n\n\n\n\n\n\nmixed-1.pdf\n\n\nmixed-2.txt\n\n\nmixed-3.pdf\n\n\nmixed-4.txt\n\n",
    "X-TIKA:parse_time_millis": "16"
  },
  {
    "Content-Length": "-1",
    "Content-Type": "application/pdf",
    "Creation-Date": "2018-01-23T21:07:49Z",
    "Last-Modified": "2018-01-23T21:07:50Z",
    "Last-Save-Date": "2018-01-23T21:07:50Z",
    "X-Parsed-By": [
      "org.apache.tika.parser.CompositeParser",
      "org.apache.tika.parser.pdf.PDFParser"
    ],
    "X-TIKA:content": "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nplop\n\n\n",
    "X-TIKA:embedded_resource_path": "/mixed-1.pdf",
    "X-TIKA:parse_time_millis": "6",
    "access_permission:assemble_document": "true",
    "access_permission:can_modify": "true",
    "access_permission:can_print": "true",
    "access_permission:can_print_degraded": "true",
    "access_permission:extract_content": "true",
    "access_permission:extract_for_accessibility": "true",
    "access_permission:fill_in_form": "true",
    "access_permission:modify_annotations": "true",
    "created": "Tue Jan 23 22:07:49 CET 2018",
    "date": "2018-01-23T21:07:50Z",
    "dc:format": "application/pdf; version=1.4",
    "dcterms:created": "2018-01-23T21:07:49Z",
    "dcterms:modified": "2018-01-23T21:07:50Z",
    "embeddedRelationshipId": "mixed-1.pdf",
    "meta:creation-date": "2018-01-23T21:07:49Z",
    "meta:save-date": "2018-01-23T21:07:50Z",
    "modified": "2018-01-23T21:07:50Z",
    "pdf:PDFVersion": "1.4",
    "pdf:docinfo:created": "2018-01-23T21:07:49Z",
    "pdf:docinfo:creator_tool": "Writer",
    "pdf:docinfo:producer": "LibreOffice 5.4",
    "pdf:encrypted": "false",
    "producer": "LibreOffice 5.4",
    "resourceName": "mixed-1.pdf",
    "xmp:CreatorTool": "Writer",
    "xmpTPg:NPages": "1"
  },
  {
    "Content-Length": "-1",
    "Content-Type": "text/plain",
    "Last-Modified": "2018-01-23T21:08:30Z",
    "Last-Save-Date": "2018-01-23T21:08:30Z",
    "X-Parsed-By": "org.apache.tika.server.resource.TikaResource$1",
    "X-TIKA:EXCEPTION:embedded_exception": "org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.server.resource.TikaResource$1@22c07473\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)\n\tat org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:317)\n\tat org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)\n\tat org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)\n\tat org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:346)\n\tat org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:283)\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)\n\tat org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:322)\n\tat org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:139)\n\tat org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:120)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)\n\tat org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)\n\tat org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:202)\n\tat org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:101)\n\tat org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)\n\tat org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)\n\tat org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)\n\tat org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)\n\tat org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:274)\n\tat org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)\n\tat org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:76)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:370)\n\tat org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)\n\tat org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:973)\n\tat org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1035)\n\tat org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:641)\n\tat org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:231)\n\tat org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)\n\tat org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)\n\tat org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: javax.ws.rs.WebApplicationException: HTTP 415 Unsupported Media Type\n\tat org.apache.tika.server.resource.TikaResource$1.parse(TikaResource.java:120)\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\t... 47 more\n",
    "X-TIKA:embedded_resource_path": "/mixed-2.txt",
    "X-TIKA:parse_time_millis": "1",
    "date": "2018-01-23T21:08:30Z",
    "dcterms:modified": "2018-01-23T21:08:30Z",
    "embeddedRelationshipId": "mixed-2.txt",
    "meta:save-date": "2018-01-23T21:08:30Z",
    "modified": "2018-01-23T21:08:30Z",
    "resourceName": "mixed-2.txt"
  },
  {
    "Content-Length": "-1",
    "Content-Type": "application/pdf",
    "Creation-Date": "2018-01-23T21:07:49Z",
    "Last-Modified": "2018-01-23T21:07:50Z",
    "Last-Save-Date": "2018-01-23T21:07:50Z",
    "X-Parsed-By": [
      "org.apache.tika.parser.CompositeParser",
      "org.apache.tika.parser.pdf.PDFParser"
    ],
    "X-TIKA:content": "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nplop\n\n\n",
    "X-TIKA:embedded_resource_path": "/mixed-3.pdf",
    "X-TIKA:parse_time_millis": "5",
    "access_permission:assemble_document": "true",
    "access_permission:can_modify": "true",
    "access_permission:can_print": "true",
    "access_permission:can_print_degraded": "true",
    "access_permission:extract_content": "true",
    "access_permission:extract_for_accessibility": "true",
    "access_permission:fill_in_form": "true",
    "access_permission:modify_annotations": "true",
    "created": "Tue Jan 23 22:07:49 CET 2018",
    "date": "2018-01-23T21:07:50Z",
    "dc:format": "application/pdf; version=1.4",
    "dcterms:created": "2018-01-23T21:07:49Z",
    "dcterms:modified": "2018-01-23T21:07:50Z",
    "embeddedRelationshipId": "mixed-3.pdf",
    "meta:creation-date": "2018-01-23T21:07:49Z",
    "meta:save-date": "2018-01-23T21:07:50Z",
    "modified": "2018-01-23T21:07:50Z",
    "pdf:PDFVersion": "1.4",
    "pdf:docinfo:created": "2018-01-23T21:07:49Z",
    "pdf:docinfo:creator_tool": "Writer",
    "pdf:docinfo:producer": "LibreOffice 5.4",
    "pdf:encrypted": "false",
    "producer": "LibreOffice 5.4",
    "resourceName": "mixed-3.pdf",
    "xmp:CreatorTool": "Writer",
    "xmpTPg:NPages": "1"
  },
  {
    "Content-Length": "-1",
    "Content-Type": "text/plain",
    "Last-Modified": "2018-01-23T21:08:30Z",
    "Last-Save-Date": "2018-01-23T21:08:30Z",
    "X-Parsed-By": "org.apache.tika.server.resource.TikaResource$1",
    "X-TIKA:EXCEPTION:embedded_exception": "org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.server.resource.TikaResource$1@22c07473\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)\n\tat org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:317)\n\tat org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)\n\tat org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)\n\tat org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:346)\n\tat org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:283)\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)\n\tat org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:322)\n\tat org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:139)\n\tat org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:120)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)\n\tat org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)\n\tat org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:202)\n\tat org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:101)\n\tat org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)\n\tat org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)\n\tat org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)\n\tat org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)\n\tat org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:274)\n\tat org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)\n\tat org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:76)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:370)\n\tat org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)\n\tat org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:973)\n\tat org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1035)\n\tat org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:641)\n\tat org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:231)\n\tat org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)\n\tat org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)\n\tat org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: javax.ws.rs.WebApplicationException: HTTP 415 Unsupported Media Type\n\tat org.apache.tika.server.resource.TikaResource$1.parse(TikaResource.java:120)\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\t... 47 more\n",
    "X-TIKA:embedded_resource_path": "/mixed-4.txt",
    "X-TIKA:parse_time_millis": "0",
    "date": "2018-01-23T21:08:30Z",
    "dcterms:modified": "2018-01-23T21:08:30Z",
    "embeddedRelationshipId": "mixed-4.txt",
    "meta:save-date": "2018-01-23T21:08:30Z",
    "modified": "2018-01-23T21:08:30Z",
    "resourceName": "mixed-4.txt"
  }
]

Is it normal to get exception stack traces in the meta field X-TIKA:EXCEPTION:embedded_exception for files that are not of the expected types?

Is there a way to ignore these files without raising exceptions, for example by explicitly associating EmptyParser with all the possible types?

Update: separate 1st issue in Define a MIME type for .TXT files for Tika, clarify and provide log for 2nd issue.

mbl
  • 101
  • 9
  • I won't bear you a grudge I you downvote me. But please leave a comment to say how I can improve the question - or why it does not have its place on SO. – mbl Jan 21 '18 at 22:23
  • Why not have a Tika Config which uses Default Parser as the main one, and excludes any other parsers you don't want? Or swap it round and only list the parsers for the formats you do want, if that's easier? Seee https://tika.apache.org/1.17/configuring.html#Configuring_Parsers – Gagravarr Jan 21 '18 at 22:43
  • Thx for your answer @Gagravarr! There is an infinity of parsers I don't want (vs a short list of parsers I want) so the 1st option is not practical. I did not achieve to make the 2nd option work. Maybe I'm doing something wrong. Could you provide a sample config please? – mbl Jan 22 '18 at 09:28
  • 1
    Start with something like https://github.com/apache/tika/blob/master/tika-core/src/test/resources/org/apache/tika/config/TIKA-866-valid.xml and just list all the parsers you want, in their own ` – Gagravarr Jan 22 '18 at 11:15
  • But how can I use EmptyParser by default? It is the same question as https://stackoverflow.com/questions/6804025/how-to-properly-configure-apache-tika-for-a-few-document-types?rq=1 but the ansewer dates back to 2011 and seems no longer relevant. – mbl Jan 22 '18 at 23:39
  • You don't use `EmptyParser`! You use the parsers you actually want. That's just showing you where to list your parsers! – Gagravarr Jan 23 '18 at 15:36
  • But when there is no parser for a given mime type, this raises an exception (and if I'm not mistaken this stops the processing of the whole package). How can I prevent that? – mbl Jan 23 '18 at 15:51
  • 1
    It shouldn't do, if there's no parser for a mimetype (or its parents) then any files of that type will be ignored – Gagravarr Jan 23 '18 at 15:53
  • I updated the question to make it more focused and clear. – mbl Jan 23 '18 at 21:54

0 Answers0