How can I add a new font to Tesseract 4.0?

Question

I'm making a text identification program and I want to train my Tesseract 4.0 to identify a specific font (in Hebrew). How can I do it?

I tried "trainyourtesseract.com" (that did'nt work at all) and "jTessBoxEditor" (that I didn't understand how to make it work properly).

I would love to get some help with that issue. Thanks.

score 2 · Answer 1 · edited Apr 25 '20 at 07:27

did you try reading this link? https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#tutorial-guide-to-lstmtraining The rough approach is that you have to prepare your own language files (and most importantly your own .trainingtext file), then run tesstrain.sh to generate the dataset. After that, you can run combine_tessdata to extract the .lstm file from the original Hebrew model and use it as a parameter in the lstmtraining tool to finetune the original model with your new font.

UPDATE: the documentation link has changed: https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00

Thusitha Deepal · Answer 2 · 2022-09-26T04:28:59.597

Detail Video watch this : https://www.youtube.com/watch?v=N5Y6gZgvryQ

Here is the shell script for the tesseract custom training

N=3 # number of images

#image name => languagename.fontname.expN.filetype

make box file

for i in `seq 1 $N`
do
tesseract testlan.arial.exp$i.png testlan.arial.exp$i batch.nochop makebox
done

after manually edit box file following steps need to be done

#Step 02: Create .tr file (Compounding image file and box file)

step 3: Extract the charset from the box files (Output for this command is unicharset file)

for i in `seq 1 $N`
do
tesseract testlan.arial.exp$i.png testlan.arial.exp$i box.train
unicharset_extractor  testlan.arial.exp$i.box
done

step 4: Create a font_properties file based on our needs.

echo "[fontname] [italic (0 or 1)] [bold (0 or 1)] [monospace (0 or 1)] [serif (0 or 1)] [fraktur (0 or 1)]" > font_properties

echo "arial 0 0 1 0 0" > font_properties

Step 5: Training the data.

#Step 6

for i in `seq 1 $N`
do
mftraining -F font_properties -U unicharset -O testlan.unicharset testlan.arial.exp$i.tr
cntraining testlan.arial.exp$i.tr
done

#after step 5 and step 6 shapetable,inttemp,pffmtable,normproto files created

Step 7: Rename four files (shapetable,inttemp,pffmtable,normproto) into ([langname].shapetable,[langname].inttemp,[langname].pffmtable,[langname].normproto)

 mv inttemp testlan.inttemp
 mv normproto testlan.normproto
 mv pffmtable testlan.pffmtable
 mv shapetable testlan.shapetable

combine_tessdata testlan.

#move testlan.traineddata to C:\Program Files\Tesseract-OCR\tessdata

watch this video: https://www.youtube.com/watch?v=N5Y6gZgvryQ — Thusitha Deepal, Sep 26 '22 at 04:28
hi, i also added question in you video too, It not clearly show that We can add new font in to trained data this way. Because i do not see any reference to the new font data. Because in create font_properties step we must include supported font in tesseract font list. Am I point to right direct tion ? — dellos, Sep 26 '22 at 09:10
Thusita Deepal, I got stucked after run mftraining with just give [Warning: No shape table file present: shapetable] and stucked there forever, I used UB_Meannheim V5.2.0.20220712. — dellos, Sep 26 '22 at 09:27