currently I'm using tika-app-1.16.jar to OCR my PDFs (when combined with Tesseract): java -jar tika-app-1.16.jar /tmp/testing/input.pdf
However, by default it only supports English. And I would like to find a way to pass a different language.
As to documentation:
When using the OCR Parser Tika will use the following default settings:
- Tesseract installation path = ""
- Language dictionary = "eng"
- Page Segmentation Mode = "1"
- Minmum file size = 0
- Maximum file size = 2147483647
- Timeout = 120
To changes these settings you can either modify the existing TesseractOCRConfig.properties file in tika-parser/src/main/resources/org/apache/tika/parser/ocr, or overriding it by creating your own and placing it in the package org/apache/tika/parser/ocr on your classpath.
It is worth noting that doing this when using one of the executable JARs, either the tika-app or tika-server JARs, will require you to execute them without using the -jar command. For example, something like the following for the tika-app or tika-server, respectively:
java -cp /path/to/your/classpath:/path/to/tika-app-X.X.jar org.apache.tika.cli.TikaCLI
java -cp /path/to/your/classpath:/path/to/tika-server-1.7-SNAPSHOT.jar org.apache.tika.server.TikaServerCli
and
For users of the Tika App, in addition to the sytem property and the environement variable, you can also use the --config=[tika-config.xml] option to select a different Tika Config XML file to use
For users of the Tika Server, in addition to the sytem property and the environement variable, you can also use -c [tika-config.xml] or --config [tika-config.xml] options to select a different Tika Config XML file to use
However, I have not been able to find working example of tika-config.xml that would change the language used by Tesseract OCR. Are there any examples available?