3

I am working on python tesseract package with sample code like the follows:

import pytesseract
from PIL import Image

tessdata_dir_config = "--tessdata-dir \"/opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/\""
image = Image.open("dataset/test.jpeg")
text = pytesseract.image_to_string(image, lang = "chi-sim", config = tessdata_dir_config)
print(text)

And I received the following error message:

pytesseract.pytesseract.TesseractError: (1, 'Error opening data file /opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/chi-sim.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'chi-sim' Tesseract couldn't load any languages! Could not initialize tesseract.')

From my understanding, the error occurred when reading the file chi-sim.traineddata (which stands for Simplified Chinese), as I will explain the attempts I have made to settle this problem below.

  • My developing environment is M1 macOS, and I installed tesseract and tesseract-lang from Homebrew. I am pretty sure that the path specified above is exactly where the source files are located, since when I call
print(pytesseract.get_languages(config = ""))

I get a long list of languages printed, including chi-sim.

  • Further, if we just use English instead of Chinese, the following code can successfully recognize the English texts in an image:
text = pytesseract.image_to_string(image)
  • I've tried to specify environment variable TESSDATA_PREFIX in multiple ways, including:
  1. Using config parameter as in the original code.

  2. Adding global environment variable in PyCharm.

  3. Adding the following line in the code

os.environ["TESSDATA_PREFIX"] = "tesseract/4.1.1/share/tessdata/"
  1. Adding the following line to bash_profile in terminal
export TESSDATA_PREFIX=/opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/

But unfortunately, none of these works.

  • It seems as if my file chi-sim.traineddata is, somehow, broken, so I directly downloaded the trained data file from GitHub (https://github.com/tesseract-ocr/tessdata/blob/master/chi_sim.traineddata), hit the "Download" button on the right, and placed the downloaded file in the tesseract-lang and original tesseract directory (where eng.traineddata is located). Yes, I've tried both, but neither works.

With respect to this issue, is there any potential solutions?

Anemonee
  • 33
  • 6
  • If you are on windows, did you setup the environmental PATH for tesseract? – seraph Jul 17 '21 at 13:07
  • Never mind, just read that you are on Mac OS, so have you tried reinstalling the whole package? – seraph Jul 17 '21 at 13:08
  • 1
    if `get_languages(config = "")` shows `chi-sim` then why do you set `tessdata-dir`? Did you try without changing `tessdata-dir`? – furas Jul 17 '21 at 13:11
  • if `chi-sim.traineddata` is broken then you have to download it again. You don't need to change `tessdata-dir` - even error shows that it is correct - but you have to get correct file from server. – furas Jul 17 '21 at 13:13
  • 1
    Also, what is your language setting in your Mac OS? There used to be some issue with non-English system language for tesseract – seraph Jul 17 '21 at 13:14
  • 1
    in question (not in comment) you could add link to GitHub where you found `chi-sim.traineddata` - and you could describe how you downloaded it. Maybe you download it in wrong way (i.e in `text-mode` instead of `bytes-mode`) or maybe you get files for older version - see GitHub with [tessdata for 4.x](https://github.com/tesseract-ocr/tessdata) there is link to [tessdata for 3.x](https://github.com/tesseract-ocr/tessdata/tree/3.04.00) – furas Jul 17 '21 at 13:22
  • @seraph Yes, I've tried reinstalling pytesseract, tesseract-lang, and tesseract at the same time, but did not work. – Anemonee Jul 17 '21 at 13:35
  • @furas Yes, the first line of code I ran was indeed without the "config = tessdata_dir" parameter, but it did not work, so I had to resort to specifying directories. – Anemonee Jul 17 '21 at 13:37
  • 1
    @seraph Hmmm... this is a good point because my general language setting of my device is chi-sim. I will check it out later and update this post if there's anything good. – Anemonee Jul 17 '21 at 13:39
  • @furas Sorry that I could not find the exact github link that I downloaded previously, I just tried the 4.x you provided but it does not work either. I've edited my question. But could you please explain a little bit on "text mode" and "bytes mode"? I was never aware of these. – Anemonee Jul 17 '21 at 13:53
  • I copied original `eng.traineddata` and file from server`chi-sim.traineddata` to new folder and try `lang="chi-sim"` and `lang="eng"` with `config="--tessdata-dir path/to/new/folder"` and it works for `eng` but not for `chi-sim` - it may means that `chi-sim.traineddata` is wrong. Maybe it is broken already on server. Or maybe it is file for older tesseract. I will check if `eng.traineddata` from server makes this problem. Maybe it will need to send information to authors that there is problem. – furas Jul 17 '21 at 15:24
  • code works for me on LInux if I use `chi_sim` with `_` instead of `-` because file downloaded from server has name `chi_sim.traineddata` also with `_` – furas Jul 17 '21 at 15:33
  • @furas interesting, upon checking tessdata on github, it does shows the datapacks with `_` instead of `-`. – seraph Jul 17 '21 at 17:26

1 Answers1

1

Code works for me on Linux if I use lang="chi_sim" with _ instead of - because file downloaded from server has name chi_sim.traineddata also with _ instead of -.


If I rename file into chi-sim.traineddata then I can use lang="chi-sim" (with - instead of _)

furas
  • 134,197
  • 12
  • 106
  • 148