Add four additional special unicode characters to tesseract

Question

I have a document regarding the transliteration of Egyptian hieroglyphs. I'm not interested now in OCR'ing the hieroglyphs but the transliteration uses 5 special characters which do not exist in English. I should not have to read a whole book in order to find out how to add these five characters to the set of characters that Tesseract can read.

I will just list one of the characters as an example which is 7717 in decimal, in Python that is chr(7717). Once I figure out how to get Tesseract to read that one, it should be simple to add the others. Does anyone know how to add this character to the set of characters Tesseract can read?

As a side note, I could find zero books written on how to use Tesseract specifically for reading PDF texts. I found a lot of books on computer vision and a few websites, but I hate websites because they never go into sufficient detail. So if anyone knows of any good books which explain how to use Python Tesseract I would appreciate it.

Also, I did try reading the Tesseract official documentation. Roughly 95% of all official documentation is bad and assumes that you already understand how to use the software but Tesseract's documentation stood out from the crowd in being particularly bad.

UPDATE

Ok, I did some more research and it seems that I have to put in some syntax where it says config:

try:
    from PIL import Image
except:
    import Image
import pytesseract
str3 =  'beylage.jpg'

str4 = pytesseract.image_to_string(Image.open(str3),
    config='-c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz.,;-(){}[]ḥ')

Although I did not receive any error message nothing changed to my output. Plus it outputted characters that I did not specify such as ? and #.

UPDATE I found out that they removed the feature with Tesseract 4.0. Worse decision they ever made. There is $100 bounty to solve this problem for 1 year and no one has solved it. https://www.bountysource.com/issues/42806964-blacklist-and-whitelist-unsupported-with-lstm-4-0

In any case, it seems that if you use the legacy version that it might work. So I put in the following syntax:

str4 = pytesseract.image_to_string(Image.open(str3),
    config='--oem 0 -c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzḥś')

But now it says that it failed to load languages. So I'm working on that problem now.

score 0 · Answer 1 · answered Jul 15 '19 at 03:45

Ok, I was able to remove the lastest bug by downloading the eng.traineddata here

https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata

It then took a lot of work to find out where to put that file on a mac but I found the answer here:

Where is the default tesseract installation folder on a mac?

However, since I was now using an older version pytesseract it had the drawback of seriously decreasing accuracy, almost to the point of illegibility. Currently, there is not solution to this problem.

Add four additional special unicode characters to tesseract

1 Answers1