I'm testing various Python image pre-processing pipelines for tesseract-ocr.
My input data are pdf invoices and receipts of all manner of quality from scanned documents (best) to mobile phone supplied photos taken in poor lighting (worst), and everything in between. When performing manual scanning for OCR, I typically choose among several scanning presets (unsharp mask, edge fill, color enhance, gamma). I'm thinking about implementing a similar solution in a Python pipeline.
I understand the standard metric for OCR quality is Levenshtein (Edit distance), which is a measure of the quality of results compared to ground truth.
What I'm after are measurements of image processing effects on OCR results qualtiy. For example, in this paper Prediction of OCR Accuracy the author describes at least two measurements White Speckle Factor (WSF) and Broken Character Factor (BCF). Other descriptors I've read include salt and pepper noise and aberrant pixels.
I've worked my way through 200 of the near 4k tesseract tagged questions here. Very interesting. Most questions are of the type, I have this kind of image, how can I improve the OCR outcomes. Nothing so far about measuring the image-processing effect on OCR outcomes.
A curious question was this one, Dirty Image Quality Assesment Measure, but the question is not focused on OCR and the solutions seem overkill.