30

I have to analyzed a image which containing both English and Japanese texts. When I run tesseract by default (-l eng), some Japanese characters lost. Otherwise, if I run tesseract with japanese (-l jpn) some English characters lost (e.g. Email).

How can I run one process which recognize both English and Japanese characters?

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
pars
  • 409
  • 1
  • 5
  • 10

2 Answers2

53

Since tesseract 3.02 it is possible to specify multiple languages for the -l parameter.

-l lang The language to use. If none is specified, English is assumed. Multiple languages may be specified, separated by plus characters. Tesseract uses 3-character ISO 639-2 language codes.

An example:

tesseract myscan.png out -l deu+eng
tobltobs
  • 2,782
  • 1
  • 27
  • 33
  • 7
    But what's the effect on precision? Is there a risk of getting some English words wrong which would be be recognised if I didn't specify another language? What if I have no idea about the language of the document and select ten languages? Does tesseract just try all languages on the whole text and then keep whichever words seem more likely to be correct based on the each language's dictionary? – Nemo Aug 22 '18 at 06:41
3

Try this:

custom_config = r'-l eng+jpn --psm 6'
txt = pytesseract.image_to_string(img, config=custom_config)

from langdetect import detect_langs
detect_langs(txt)

Note: you have to install langdetect by using:

 pip install langdetect
rahul
  • 53
  • 4