Tesseract OCR with numeric tables

Question

I need to OCR old statistical tables that contain numerical values for each town in a given area. I use Tesseract 4.0.0-beta.3, and in most cases I get acceptable results, but in some others the software fails to recognise the structure of the table and skips rows or entire columns.

I was trying to apply a more suitable configuration by checking --help-psm, but honestly I couldn't figure out which one could improve my results. I also tried to slice up the tables to individual columns, but the results were even worse. I suppose the issue is that some cells contain 1 or 2 digit numbers, and the rows are deemed to short, which is usually good, but here it is rather problematic. What settings would you use to optimise results?

score 0 · Answer 1 · edited Nov 06 '19 at 08:34

In a similar situation I was using

tesseract image test --psm 6 --oem 0 digits

I even deleted the left text - to be processed separately.
Number recognition was ok, but my problem was, that I have ~10 columns and some are blank in some rows, but tesseract sometimes ignores the vertical lines, sometimes displays them as "1", unpredictedly.
I tried several settings, even deleted the vertical lines, but couldn't get tesseract to keep the table structure for subsequent computer-read.

Hope it helps.

Tesseract OCR with numeric tables

1 Answers1