Automatically compare OCR results to measure accuracy and consensus

Question

Given end-to-end consumer OCR tools like Tesseract, that output positions and confidence scores of detected words/lines on a page, how can we robustly measure "accuracy" and consensus between different results when page layouts are complex?

Previously considered approaches:

I've seen discussion of approaches that ignore layout and measure text edit distance (e.g. on SO here and here)... But this seems fragile for docs with complex layout like forms, which might yield quite different plain text reading orders depending on the OCR engine used and even the configuration of a specific engine.

There are also good arguments for use-case-oriented metrics: For example training a Layout+Language model to extract the particular fields you're interested in and measuring edit distance only on those (The hope being that extracting specific expected fields simplifies the layout/reading-order issues by e.g. just having one field of each type per doc). However, this requires data+training and introduces another component to potentially contribute errors. Zero-shot or "off-the-shelf" methods like generic key-value extraction or question answering models might be able to help, but the user needs to figure out how to map detected keys to a common ontology, or choose what questions to ask to measure accuracy. Finally, the more narrow/precise we are in defining the business use-case, the less trust these approaches can build that the overall foundational accuracy is "good enough".

Some folks discuss using OCR confidence as a proxy for accuracy, which seems risky not only because different tools might be calibrated differently/badly - but also because different configurations of the same tool might affect calibration too.

Since the goal's to compare end-to-end tools (including black-box options like cloud services), we can't really dive in to the separate sub-components that usually make up an OCR system - like quality of text detection, line-wise text recognition, etc.

Ideas and challenges:

It feels like it should be possible to define some kind of layout-aware distance metrics between alternative results, similar to how Intersection-over-Union is used in Object detection to compare and consolidate detections...

But I'm not really sure how to deal with the challenge that different tools/configurations might segment words or lines differently: Most tools don't output character-level bounding boxes, making it tough to reconcile e.g. [2000-01-01] with [2000, -01, -01] or other permutations.

Some kind of global metric of "agreement" of a page would be useful for measuring against a ground truth... But a more local ability to synthesize multiple alternative results into a single consolidated OCR output would also be great for situations where we don't know what the ground truth should be yet.

Automatically compare OCR results to measure accuracy and consensus

0 Answers0