2

currently I'm using tika-app-1.16.jar to OCR my PDFs (when combined with Tesseract): java -jar tika-app-1.16.jar /tmp/testing/input.pdf

However, by default it only supports English. And I would like to find a way to pass a different language.

As to documentation:

When using the OCR Parser Tika will use the following default settings:

  • Tesseract installation path = ""
  • Language dictionary = "eng"
  • Page Segmentation Mode = "1"
  • Minmum file size = 0
  • Maximum file size = 2147483647
  • Timeout = 120

To changes these settings you can either modify the existing TesseractOCRConfig.properties file in tika-parser/src/main/resources/org/apache/tika/parser/ocr, or overriding it by creating your own and placing it in the package org/apache/tika/parser/ocr on your classpath.

It is worth noting that doing this when using one of the executable JARs, either the tika-app or tika-server JARs, will require you to execute them without using the -jar command. For example, something like the following for the tika-app or tika-server, respectively:

java -cp /path/to/your/classpath:/path/to/tika-app-X.X.jar org.apache.tika.cli.TikaCLI

java -cp /path/to/your/classpath:/path/to/tika-server-1.7-SNAPSHOT.jar org.apache.tika.server.TikaServerCli

and

For users of the Tika App, in addition to the sytem property and the environement variable, you can also use the --config=[tika-config.xml] option to select a different Tika Config XML file to use

For users of the Tika Server, in addition to the sytem property and the environement variable, you can also use -c [tika-config.xml] or --config [tika-config.xml] options to select a different Tika Config XML file to use

However, I have not been able to find working example of tika-config.xml that would change the language used by Tesseract OCR. Are there any examples available?

Community
  • 1
  • 1
Gugols
  • 63
  • 1
  • 9
  • Why not do as the first snippet says, copy the properties file, change it, then pop that in the right place on your classpath? – Gagravarr Nov 25 '17 at 17:08
  • @Gagravarr Currently I'm using the Tika app binary (not the full source install). By looking at the example, I added a Tesseract.properties file and try to reference it(classpath?). However, without any difference (I'm not sure if even it get registered): java -cp tika-parser/src/main/resources/org/apache/tika/parser/ocr/TesseractOCRConfig.properties:tika-app-1.16.jar org.apache.tika.cli.TikaCLI /tmp/testing/sample.pdf – Gugols Nov 29 '17 at 12:48
  • Don't specify the path to the properties on the classpath, specify the path to the root directory holding its tree, eg `tika-parser/src/main/resources` in your case – Gagravarr Nov 29 '17 at 13:21

1 Answers1

1

I'm using next "crutch" - substitute original tesseract file with a bash script with same name which replace run arguments =)

My /usr/bin/tesseract file:

#!/bin/sh

args=$@
args=${args/eng/rus} #replace eng => rus 
export TESSDATA_PREFIX=/usr/share/tesseract/
# tesseract_ori <-- original tesseract 
/usr/bin/tesseract_ori $args >> /tmp/tess.log 2>&1
alxm
  • 51
  • 3