How to configure pytesseract to support text detection for non English language in windows 10?

Question

I have tried pytesseract for English. It's working fine and generates expected result. But when it comes for other languages (eg: Arabic) other than english, it fails to do so and gives following error:

TesseractError: (1, 'Error opening data file C:\\Program Files (x86)\\Tesseract-OCR\\ara.traineddata 
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. 
Failed loading language \'ara\' Tesseract couldn\'t load any languages! 
Could not initialize tesseract.')

Tried to get it (ara.traineddata) done from github, but can't get it done.

I found page [Traineddata Files for Version 4.00 +](https://tesseract-ocr.github.io/tessdoc/Data-Files.html) and there is link to `ara.traineddata` which I can download. I didn't test this file with tesseract — furas, Jan 05 '21 at 05:09
after downloading I can do `tesseract image.png output-file --tessdata-dir folder/with/files/tessdata/ -l ara` — furas, Jan 05 '21 at 05:17
Did as per https://developpaper.com/win10-installs-tesserocr-to-configure-python-to-recognize-alphanumeric-captcha-with-tesserocr/ and installed `testerocr` and `tesserocr`. And Now pytesseract doesn't create any error, and also doesn't generates any output. Just does clearscreen of python_console. Instead of pytesseract, only tesserocr could do the job. — Nadeem Anwar, Jan 06 '21 at 03:44
`pytesseract` simply execute command like `tesseract image.png output-file ...` so it can also get arguments like `--tessdata-dir` - probably as dictionary with extra options — furas, Jan 06 '21 at 04:02

furas · Answer 1 · 2021-01-06T07:20:52.960

pytesseract is only wrapper on program tesseract (OCR developed by Google)

tesseract needs files with languages which you can find in its documentation: Data Files.

You can download ara.traineddata to some folder and run it with option --tessdata-dir some_folder and then it will use ara.traineddata from this folder.

If you save ara.traineddata in the same folder as you run code then you can use . (dot)

tesseract image.jpg stdout -l ara --tessdata-dir .

And the same you can do in pytesseract using config=

import pytesseract

text = pytesseract.image_to_string('image.jpg', lang='ara', config='--tessdata-dir .')

print(text)

Eventually you can use environment variable TESSDATA_PREFIX for this

import pytesseract
import os

os.environ['TESSDATA_PREFIX'] = '.'

text = pytesseract.image_to_string('text-ara.jpg', lang='ara')

print(text)

Later you can set TESSDATA_PREFIX directly in system or you may try to move ara.traineddata to folder with other files .traineddata. There should be somewhere eng.traineddata which you can try to find with programs/command like find

I tested it with this image which I found also in documentation: Command Line Usage

BTW: tesseract normally saves text in file but if you use stdout then it displays text in console.

How to configure pytesseract to support text detection for non English language in windows 10?

1 Answers1