How can I run tesseract with multiple languages one time?

Question

I have to analyzed a image which containing both English and Japanese texts. When I run tesseract by default (-l eng), some Japanese characters lost. Otherwise, if I run tesseract with japanese (-l jpn) some English characters lost (e.g. Email).

How can I run one process which recognize both English and Japanese characters?

hope this will help: https://github.com/rmtheis/tess-two/issues/28 — David V, Jun 24 '14 at 09:28
See https://stackoverflow.com/questions/16508796/how-can-i-use-multiple-language-support-on-android-with-tesseract — sashoalm, Dec 22 '14 at 12:40

score 53 · Accepted Answer · answered Dec 22 '14 at 12:36

53

Since tesseract 3.02 it is possible to specify multiple languages for the -l parameter.

-l lang The language to use. If none is specified, English is assumed. Multiple languages may be specified, separated by plus characters. Tesseract uses 3-character ISO 639-2 language codes.

An example:

tesseract myscan.png out -l deu+eng

answered Dec 22 '14 at 12:36

tobltobs

2,782
1
27
33

7

But what's the effect on precision? Is there a risk of getting some English words wrong which would be be recognised if I didn't specify another language? What if I have no idea about the language of the document and select ten languages? Does tesseract just try all languages on the whole text and then keep whichever words seem more likely to be correct based on the each language's dictionary? – Nemo Aug 22 '18 at 06:41

score 3 · Answer 2 · answered Oct 15 '20 at 07:34

3

Try this:

custom_config = r'-l eng+jpn --psm 6'
txt = pytesseract.image_to_string(img, config=custom_config)

from langdetect import detect_langs
detect_langs(txt)

Note: you have to install langdetect by using:

 pip install langdetect

answered Oct 15 '20 at 07:34

rahul

53
4

How can I run tesseract with multiple languages one time?

2 Answers2

Linked