5

I'm using Apache Tika App on my Ubuntu 16.04 Server as a comand line tool to extract content of documents.

The [Apache Tika website][1] says the following:

Build artifacts

The Tika build consists of a number of components and produces the following main binaries:

tika-core/target/tika-core-*.jar Tika core library. Contains the core interfaces and classes of Tika, but none of the parser implementations. Depends only on Java 6.

tika-parsers/target/tika-parsers-*.jar Tika parsers. Collection of classes that implement the Tika Parser interface based on various external parser libraries.

tika-app/target/tika-app-*.jar Tika application. Combines the above components and all the external parser libraries into a single runnable jar with a GUI and a command line interface.

So I have downloaded the last verstion (1.18) of tika-app-*.jar. That was just a single file.

Running this in a command line like java -jar tika-app-1.18.jar -t <filename> gives me the needed output of the file content but also each time I get two warnings:

July 28, 2018 3:29:27 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies.

July 28, 2018 3:29:27 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version.

I don't know if those warning slow things down but it is hard to follow other output amongst those repetative warnings.

I have tried to point Tika to my own configuration file by:

java -jar tika-app-1.18.jar --config=tika-config.xml -t <filename>

My tika-config.xml file is:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>image/jpeg</mime-exclude>
      <mime-exclude>application/x-sqlite3</mime-exclude>
      <parser-exclude class="org.apache.tika.parser.jdbc.SQLite3Parser"/>
    </parser>
  </parsers>
</properties>

If I use that config I get No protocol: filename.doc and the warnings are still in place.

How to exclude jpeg and sqlite parsers?

user164863
  • 580
  • 1
  • 12
  • 29
  • 1
    Did you read and follow http://tika.apache.org/1.18/configuring.html? – Gagravarr Jul 29 '18 at 15:22
  • @Gagravarr Thank you, no I didn't read that. So based on that I'm correctly feeding the configuration file. I can probably use ` image/jpeg` to avoid images to be parsed. I would probably need a default config file, do I still use content of POM.XML? And sqlite parsers probably gets excluded the same way as images, correct? – user164863 Jul 29 '18 at 18:18
  • 2
    You only need `pom.xml` if you are compiling Tika yourself, which you don't need to do when configuring the app! – Gagravarr Jul 29 '18 at 22:50
  • @Gagravarr Ok, I get it. But I try to make a config file just with the first exmple on how parsers can be configured and then I do `java -jar tika-app-1.18.jar --config=tika-config.xml -t ` and I get `No protocol: filename.doc` And then what is mime type for sqlite files? – user164863 Jul 30 '18 at 08:22
  • @Gagravarr I have updated my question based on the link you gave me – user164863 Jul 30 '18 at 08:28
  • 2
    Those warnings come at initialisation time, you're excluding things at parse time. You probably just want to follow http://tika.apache.org/1.18/configuring.html#Load_Error_Handling to turn off the warnings – Gagravarr Jul 30 '18 at 09:06

1 Answers1

3

My solution was this tika-config.xml file:

 <?xml version="1.0" encoding="UTF-8"?>
 <properties>
   <service-loader loadErrorHandler="IGNORE"/>
   <service-loader initializableProblemHandler="ignore"/>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
    <mime-exclude>image/jpeg</mime-exclude>
    <mime-exclude>application/x-sqlite3</mime-exclude>
    <parser-exclude class="org.apache.tika.parser.jdbc.SQLite3Parser"/>
   </parser>
  </parsers>
  </properties>

and then set:

export TIKA_CONFIG=/path/to/tika-config.xml

in my .bashrc file.

aarkerio
  • 2,183
  • 2
  • 20
  • 34