i am attempting to extract OCR data of a 3-digit counter within a video via tesseract 4.1.1 on Kubuntu 21.04. (full tesseract version string below.) i am failing to add characters during the shapetable phase, and no other troubleshooting has worked for me -- i turn to you with humble heart. n.b.: the images are of a small pixel font, which takes up the entirety of my source image
image preparation and collation
from the source videos, i: crop to only the counter, invert, grayscale, dump at 1 fps, and then increase resolution by 1000% to 780x180
resolution. the results are individual frames such as this. i take a section of sequential numbers counting down from 500 (without any duplicates or blank images) and combine them into a .tif
. (i can't upload the file here, but find the set of images mosaic'd together here)
i import this file into jTessBoxEditor as, for example, type_3.font.exp0.tif
. i run tesseract --psm 6 --oem 3 font_name.font.exp0.tif font_name.font.exp0 makebox
to create a .box file, with understandably nonsensical results.
with the hand-chosen source frames and the consistent positions, i'm able to edit the .box file with known box sizes, quantities, like so:
5 0 0 240 180 0
0 270 0 510 180 0
0 540 0 780 180 0
4 0 0 240 180 1
9 270 0 510 180 1
9 540 0 780 180 1
4 0 0 240 180 2
9 270 0 510 180 2
8 540 0 780 180 2
4 0 0 240 180 3
9 270 0 510 180 3
7 540 0 780 180 3
...
i load the edited .box into the jTessBoxEditor to check that it indeed matches my data. this is a 131-page .tif, meaning roughly 40 trains per digit.
training steps (where the problems begin)
i create font_properties
and load it with font 0 0 0 0 0
. Please note that i've also tried type_3 0 0 0 0 0
and type_3.font.exp0 0 0 0 0 0
, with no difference on the below results
i input tesseract type_3.font.exp0.tif type_3.font.exp0 nobatch box.train
and a training file is created; however, each page is listed as blank (is this normal?). e.g.:
Page 108
Warning: Invalid resolution 1 dpi. Using 70 instead.
Estimating resolution as 2263
Empty page!!
i input unicharset_extractor font_name.font.exp0.box
with success -- the resulting extraction contains the characters i've identified, with some extra lines
13
NULL 0 Common 0
Joined 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined # Joined [4a 6f 69 6e 65 64 ]a
|Broken|0|1 15 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1 # Broken
5 8 0,255,0,255,0,0,0,0,0,0 Common 3 2 3 5 # 5 [35 ]0
0 8 0,255,0,255,0,0,0,0,0,0 Common 4 2 4 0 # 0 [30 ]0
4 8 0,255,0,255,0,0,0,0,0,0 Common 5 2 5 4 # 4 [34 ]0
9 8 0,255,0,255,0,0,0,0,0,0 Common 6 2 6 9 # 9 [39 ]0
8 8 0,255,0,255,0,0,0,0,0,0 Common 7 2 7 8 # 8 [38 ]0
7 8 0,255,0,255,0,0,0,0,0,0 Common 8 2 8 7 # 7 [37 ]0
6 8 0,255,0,255,0,0,0,0,0,0 Common 9 2 9 6 # 6 [36 ]0
3 8 0,255,0,255,0,0,0,0,0,0 Common 10 2 10 3 # 3 [33 ]0
2 8 0,255,0,255,0,0,0,0,0,0 Common 11 2 11 2 # 2 [32 ]0
1 8 0,255,0,255,0,0,0,0,0,0 Common 12 2 12 1 # 1 [31 ]0
but i know that failure has come for me when shapeclustering -F font_properties -U unicharset -O type_3.unicharset type_3.font.exp0.tr
results in
Reading type_3.font.exp0.tr ...
Building master shape table
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
...
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Master shape_table:Number of shapes = 0 max unichars = 0 number with multiple unichars = 0
It has not recognized any shapes at all.
my plea:
what have i missed?? what can i do to pass these 10 humble characters to tesseract?
full version string (installed via apt)
tesseract 4.1.1
leptonica-1.79.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.5