How do I install a new language pack for Tesseract on Windows

Question

I have installed the pytesseract module in my venv and want to extract text from a german file

with executingthis script from pytesseract and setting the lenguage to german

import cv2

import pytesseract


try:
    from PIL import Image
except ImportError:
    import Image

print(pytesseract.image_to_string(Image.open('test.jpg')))

print(pytesseract.image_to_string(Image.open('test.jpg'), lang='ger'))

which gives me

raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, 'Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
 Error opening data file C:\\Program Files (x86)\\Tesseract-OCR/tessdata/ger.traineddata
 Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory. Failed loading language \'ger\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')

I have found the lenguage data on [tessdoc/Data-Files] (https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files.md)

so far I only found an guide for linux How do I install a new language pack for Tesseract on 16.04
where to I need to move the lenguage files in my pyteseract sidepackage to get the script working ?

score 3 · Answer 1 · answered Aug 16 '21 at 08:49

There are two ways.

1. Install the corresponding tesseract package for your language -

apt-get install tesseract-ocr-YOUR_LANG_CODE

for example- in my case it was Bengali so I installed -

apt-get install tesseract-ocr-ben

or for installing all languages -

apt-get install tesseract-ocr-all.

This worked for me Ubuntu environment.

2. The other way is mentioned in the error message itself. Add an environment variable `TESSDATA_PREFIX` that point to the langauge pack. You can download the language pack from here: `https://github.com/tesseract-ocr/tessdata` .

Once you have downloaded the datapack you can also programmatically set the environment variable as

import os
os.putenv('TESSDATA_PREFIX','path/to/your/tessdata/file'

score 1 · Answer 2 · edited Nov 14 '21 at 01:07

Best way I've found:

Download and install tesseract-ocr-w64-setup-v5.0.0-rc1.20211030.exe.
Open https://github.com/tesseract-ocr/tessdata and download your language. For example, for Farsi download fas.traineddata.
Copy the downloaded file to the tessreact_ocr installation location, some location like: C:\Program Files\Tesseract-OCR\tessdata
Don't forget to use the traineddata name for the language. For Farsi, I use lang='fas'.

score 0 · Accepted Answer · answered Jul 23 '20 at 07:51

0

found a guide to do this on a german site Python Texterkennung: Bild zu Text mit PyTesseract in Windows

answered Jul 23 '20 at 07:51

Sator

636
4
13
34

How do I install a new language pack for Tesseract on Windows

3 Answers3

1. Install the corresponding tesseract package for your language -

2. The other way is mentioned in the error message itself. Add an environment variable TESSDATA_PREFIX that point to the langauge pack. You can download the language pack from here: https://github.com/tesseract-ocr/tessdata .

2. The other way is mentioned in the error message itself. Add an environment variable `TESSDATA_PREFIX` that point to the langauge pack. You can download the language pack from here: `https://github.com/tesseract-ocr/tessdata` .