15

I just started using Tesseract.

I am following the instructions described here.

I have created a test image like this:

training/text2image --text=test.txt --outputbase=eng.Arial.exp0 --font='Arial' --fonts_dir=/usr/share/fonts

Now I want to train the Tesseract like follows:

tesseract eng.Arial.exp0.tif eng.Arial.exp0 box.train

Here is the output that I have:

Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
APPLY_BOXES:
   Boxes read from boxfile:     112
   Found 112 good blobs.
Generated training data for 21 words
Warning in pixReadMemTiff: tiff page 1 not found

This prevents the creation of fontfile.tr file. I have tried continuing by ignoring the warning, but when creating the char-sets I get an awefull content:

unicharset_extractor lang.fontname.exp0.box

"58
NULL 0 NULL 0
Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0     # Joined [4a 6f 69 6e 65 64 ]
|Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0    # Broken
T 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # T [54 ]
h 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # h [68 ]
e 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # e [65 ]
( 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # ( [28 ]
q 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # q [71 ]
u 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # u [75 ]
..."

Here is the version I am using:

tesseract 3.04.00
 leptonica-1.72
  libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8

Any idea why this happens?

arghtype
  • 4,376
  • 11
  • 45
  • 60

1 Answers1

0

It may be a bug, I'm using v4.00.00alpha and I get

Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
APPLY_BOXES:
   Boxes read from boxfile:     100
   Found 100 good blobs.
Generated training data for 21 words
parkydr
  • 7,596
  • 3
  • 32
  • 42