0

I'm testing various Python image pre-processing pipelines for tesseract-ocr.

My input data are pdf invoices and receipts of all manner of quality from scanned documents (best) to mobile phone supplied photos taken in poor lighting (worst), and everything in between. When performing manual scanning for OCR, I typically choose among several scanning presets (unsharp mask, edge fill, color enhance, gamma). I'm thinking about implementing a similar solution in a Python pipeline.

I understand the standard metric for OCR quality is Levenshtein (Edit distance), which is a measure of the quality of results compared to ground truth.

What I'm after are measurements of image processing effects on OCR results qualtiy. For example, in this paper Prediction of OCR Accuracy the author describes at least two measurements White Speckle Factor (WSF) and Broken Character Factor (BCF). Other descriptors I've read include salt and pepper noise and aberrant pixels.

I've worked my way through 200 of the near 4k tesseract tagged questions here. Very interesting. Most questions are of the type, I have this kind of image, how can I improve the OCR outcomes. Nothing so far about measuring the image-processing effect on OCR outcomes.

A curious question was this one, Dirty Image Quality Assesment Measure, but the question is not focused on OCR and the solutions seem overkill.

xtian
  • 2,765
  • 7
  • 38
  • 65

1 Answers1

0

There is no universal image improvement technique for OCR-ability. Every image defect is (partly) corrected with ad-hoc techniques, and a technique that works in one case can be counter-productive in another.

For a homogenous data set (in the sense that all documents have similar origin/quality and were captured in the same conditions), you can indeed optimize the preprocessing chain by trying different combinations and settings, and computing the total edit distance. But this requires preliminary knowledge of the ground truth (at least for a sampling of the documents).

But for heterogeneous data sets, there is little that you can do. There remains the option of testing different preprocessing chains and relying on the recognition scores returned by the OCR engine, assuming that better readability corresponds to better correctness.


You might also extract some global image characteristic such as contrast, signal-to-noise ratio, sharpness, character size and density... and optimize the readability as above. Then feed this info to a classifier that will learn how to handle the different image conditions. Honestly, I don't really believe in this approach.

  • You do me a disservice to interpret my goal as seeking _"universal image improvement technique"_. However, P4 is interesting, What exactly is measured? The key measures (resolution, contrast, sharpness, geometry) which lead to good OCR outcomes are too abstract alone. So, Edit Distance. Now here is my naïveté, because my OP didn't dissuade readers from assuming I was performing state of the art model training. No. More basic. I'm looking for "opinionated" measures of OCRimage quality. Naively, I suppose a histogram of values, in relation to positively identified characters is one measure. – xtian Jan 16 '22 at 20:12
  • @xtian: edit distance is only defined when you have the ground truth, i.e. when you don't need to perform OCR at all. –  Jan 16 '22 at 20:43