How to change the language parameter that Tika passes to Tesseract OCR?

Question

currently I'm using tika-app-1.16.jar to OCR my PDFs (when combined with Tesseract): java -jar tika-app-1.16.jar /tmp/testing/input.pdf

However, by default it only supports English. And I would like to find a way to pass a different language.

As to documentation:

When using the OCR Parser Tika will use the following default settings:

Tesseract installation path = ""

Language dictionary = "eng"

Page Segmentation Mode = "1"

Minmum file size = 0

Maximum file size = 2147483647

Timeout = 120

To changes these settings you can either modify the existing TesseractOCRConfig.properties file in tika-parser/src/main/resources/org/apache/tika/parser/ocr, or overriding it by creating your own and placing it in the package org/apache/tika/parser/ocr on your classpath.

It is worth noting that doing this when using one of the executable JARs, either the tika-app or tika-server JARs, will require you to execute them without using the -jar command. For example, something like the following for the tika-app or tika-server, respectively:

java -cp /path/to/your/classpath:/path/to/tika-app-X.X.jar org.apache.tika.cli.TikaCLI

java -cp /path/to/your/classpath:/path/to/tika-server-1.7-SNAPSHOT.jar org.apache.tika.server.TikaServerCli

and

For users of the Tika App, in addition to the sytem property and the environement variable, you can also use the --config=[tika-config.xml] option to select a different Tika Config XML file to use

For users of the Tika Server, in addition to the sytem property and the environement variable, you can also use -c [tika-config.xml] or --config [tika-config.xml] options to select a different Tika Config XML file to use

However, I have not been able to find working example of tika-config.xml that would change the language used by Tesseract OCR. Are there any examples available?

Why not do as the first snippet says, copy the properties file, change it, then pop that in the right place on your classpath? — Gagravarr, Nov 25 '17 at 17:08
@Gagravarr Currently I'm using the Tika app binary (not the full source install). By looking at the example, I added a Tesseract.properties file and try to reference it(classpath?). However, without any difference (I'm not sure if even it get registered): java -cp tika-parser/src/main/resources/org/apache/tika/parser/ocr/TesseractOCRConfig.properties:tika-app-1.16.jar org.apache.tika.cli.TikaCLI /tmp/testing/sample.pdf — Gugols, Nov 29 '17 at 12:48
Don't specify the path to the properties on the classpath, specify the path to the root directory holding its tree, eg `tika-parser/src/main/resources` in your case — Gagravarr, Nov 29 '17 at 13:21

score 1 · Answer 1 · answered Jul 01 '18 at 11:47

I'm using next "crutch" - substitute original tesseract file with a bash script with same name which replace run arguments =)

My /usr/bin/tesseract file:

#!/bin/sh

args=$@
args=${args/eng/rus} #replace eng => rus 
export TESSDATA_PREFIX=/usr/share/tesseract/
# tesseract_ori <-- original tesseract 
/usr/bin/tesseract_ori $args >> /tmp/tess.log 2>&1

How to change the language parameter that Tika passes to Tesseract OCR?

1 Answers1