1

i am attempting to extract OCR data of a 3-digit counter within a video via tesseract 4.1.1 on Kubuntu 21.04. (full tesseract version string below.) i am failing to add characters during the shapetable phase, and no other troubleshooting has worked for me -- i turn to you with humble heart. n.b.: the images are of a small pixel font, which takes up the entirety of my source image

image preparation and collation

from the source videos, i: crop to only the counter, invert, grayscale, dump at 1 fps, and then increase resolution by 1000% to 780x180 resolution. the results are individual frames such as this. i take a section of sequential numbers counting down from 500 (without any duplicates or blank images) and combine them into a .tif. (i can't upload the file here, but find the set of images mosaic'd together here)

i import this file into jTessBoxEditor as, for example, type_3.font.exp0.tif. i run tesseract --psm 6 --oem 3 font_name.font.exp0.tif font_name.font.exp0 makebox to create a .box file, with understandably nonsensical results.

with the hand-chosen source frames and the consistent positions, i'm able to edit the .box file with known box sizes, quantities, like so:

5 0 0 240 180 0
0 270 0 510 180 0
0 540 0 780 180 0
4 0 0 240 180 1
9 270 0 510 180 1
9 540 0 780 180 1
4 0 0 240 180 2
9 270 0 510 180 2
8 540 0 780 180 2
4 0 0 240 180 3
9 270 0 510 180 3
7 540 0 780 180 3
...

i load the edited .box into the jTessBoxEditor to check that it indeed matches my data. this is a 131-page .tif, meaning roughly 40 trains per digit.

training steps (where the problems begin)

i create font_properties and load it with font 0 0 0 0 0. Please note that i've also tried type_3 0 0 0 0 0 and type_3.font.exp0 0 0 0 0 0, with no difference on the below results

i input tesseract type_3.font.exp0.tif type_3.font.exp0 nobatch box.train and a training file is created; however, each page is listed as blank (is this normal?). e.g.:

Page 108
Warning: Invalid resolution 1 dpi. Using 70 instead.
Estimating resolution as 2263
Empty page!!

i input unicharset_extractor font_name.font.exp0.box with success -- the resulting extraction contains the characters i've identified, with some extra lines

13
NULL 0 Common 0
Joined 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined # Joined [4a 6f 69 6e 65 64 ]a
|Broken|0|1 15 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1    # Broken
5 8 0,255,0,255,0,0,0,0,0,0 Common 3 2 3 5  # 5 [35 ]0
0 8 0,255,0,255,0,0,0,0,0,0 Common 4 2 4 0  # 0 [30 ]0
4 8 0,255,0,255,0,0,0,0,0,0 Common 5 2 5 4  # 4 [34 ]0
9 8 0,255,0,255,0,0,0,0,0,0 Common 6 2 6 9  # 9 [39 ]0
8 8 0,255,0,255,0,0,0,0,0,0 Common 7 2 7 8  # 8 [38 ]0
7 8 0,255,0,255,0,0,0,0,0,0 Common 8 2 8 7  # 7 [37 ]0
6 8 0,255,0,255,0,0,0,0,0,0 Common 9 2 9 6  # 6 [36 ]0
3 8 0,255,0,255,0,0,0,0,0,0 Common 10 2 10 3    # 3 [33 ]0
2 8 0,255,0,255,0,0,0,0,0,0 Common 11 2 11 2    # 2 [32 ]0
1 8 0,255,0,255,0,0,0,0,0,0 Common 12 2 12 1    # 1 [31 ]0

but i know that failure has come for me when shapeclustering -F font_properties -U unicharset -O type_3.unicharset type_3.font.exp0.tr results in

Reading type_3.font.exp0.tr ...
Building master shape table
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000

...

Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Master shape_table:Number of shapes = 0 max unichars = 0 number with multiple unichars = 0

It has not recognized any shapes at all.

my plea:

what have i missed?? what can i do to pass these 10 humble characters to tesseract?

full version string (installed via apt)

tesseract 4.1.1
 leptonica-1.79.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.5
person
  • 11
  • 2

0 Answers0