1

I am very new to tesseract. I am following this tutorial on running a bash script to train data for Tesseract. I intend to created trained data for the BM Mini font. I have created a box file for my trained data image (I am only using one so far). This is the code to make the box file:

tesseract='C:/Program Files/Tesseract-OCR/tesseract.exe'
"$tesseract" eng.bmmini.exp0.png eng.bmmini.exp0 batch.nochop makebox

The box file was made and I corrected the values. I used qt-box-editor to do so.

I then run this part of the script:

tesseract='C:/Program Files/Tesseract-OCR/tesseract.exe'
font_properties eng.inttemp eng.normproto *.tr *.txt
"$tesseract" eng.bmmini.exp0.png eng.bmmini.exp0 nobatch box.train
unicharset_extractor eng.bmmini.exp0.box
echo “bmmini 0 0 0 0 0” > font_properties

Unicharset, font_properties, and .tr files are created. I can open them in Notepad++, but I do not have the experience to see if there is any error in them. I use Git Bash to run these scripts, and this is in the prompt:

$ bash train.sh
Tesseract Open Source OCR Engine v5.0.0-alpha.20210811 with Leptonica
Estimating resolution as 158
APPLY_BOXES:
   Boxes read from boxfile:      62
   Found 62 good blobs.
   Leaving 1 unlabelled blobs in 0 words.
Generated training data for 3 words
Extracting unicharset from box file eng.bmmini.exp0.box
Wrote unicharset file unicharset
(base)

No error so far. However, I add this line into my file:

mftraining –F font_properties –U unicharset –O eng.unicharset eng.bmmini.exp0.tr

Then I get these errors:

Bad box coordinates in boxfile string!  &x
Bad format in tr file, reading box coords
Bad format in tr file, reading fontname, unichar
Bad format in tr file, reading fontname, unichar
Bad box coordinates in boxfile string!  &▒RH▒(▒▒
Bad format in tr file, reading box coords
Bad format in tr file, reading fontname, unichar
Bad format in tr file, reading fontname, unichar
Bad format in tr file, reading fontname, unichar
...

I have a hunch it is something wrong in my box file but I do not know how to determine that. I am also new to Stack Overflow, and I am not sure how I can share my files for help. Thank you for your patience!

1 Answers1

0

This is an old, outdated tutorial for tessract 3.x But you use it for tesseract 5, which uses totally different engine. Use/follow the official training process: https://github.com/tesseract-ocr/tesstrain

user898678
  • 2,994
  • 2
  • 18
  • 17