Preprocess bad scans (partially blurred, shadowed and slightly skewed) for OCR

Question

I try to read documents from various sources with Python. Therefore I am using OpenCV and Tesseract. To optimize the Tesseract performance, I do some preprocessing, but sadly the documents also vary in quality a lot. My current issue are documents that are only partially blurry or shaded, due to bad scans.

I have no influence on the document quality and manual feature detection is not applicable, because the code should run over hundred thousands of documents in the end and even inside a document the quality can vary strongly.

To get rid of the shadow, I found a technique with delating and blurring the image and dividing the original with the dilated version.

h, w = img.shape
kernel = np.ones((7, 7), np.uint8)
dilation = cv2.dilate(img, kernel, iterations=1)
blurred_dilation = cv2.GaussianBlur(dilation, (13, 13), 0)
resized = cv2.resize(blurred_dilation, (w, h))
corrected = img / resized * 255

That works very well.

But I still got that blur and optically it got worse to read. I'd like to do a binarisation next, but then would be nothing valuable left from the blurred parts.

I found an example of a deconvolution that works for motion blur, but I can only apply it to the whole image, which blurs the rest of the text, and I need to know the direction of the motion blur. So, I hope to get some help on how to optimize this kind of image, so that tesseract can properly read it.

I know that there should be further optimizations besides sharpening the blurred text. Deskewing and getting rid of the fragments of another pages. These I am not sure about the proper sequence how to perform these additional steps.

I can hardly find sources or tutorials for plain document optimization for OCR processes. Often the procedures apply globally to the whole image or are for non OCR applications.

deskew: find peaks in FFT. very robust. http://www.fmwconcepts.com/imagemagick/textdeskew/index.php by the way: hough transform is a bad idea. inaccurate or expensive. also stochastic. — Christoph Rackwitz, Jul 01 '21 at 16:20

score 2 · Answer 1 · answered Jun 30 '21 at 14:02

Reminds me of this article I read a few years ago: https://medium.com/illuin/cleaning-up-dirty-scanned-documents-with-deep-learning-2e8e6de6cfa6

Contrary to the title, it contains a variety of classic computer vision algorithms for your inspiration.

To remove shadow, I've personally had median filtering as described (removing a median-filtered background) work more effectively than what you show here.
To deskew, I've experimented with Hough transform and got good results.

Intuitively, if you should know the font type and size in advance, that should help as well.

Mason Ji Ming · Answer 2 · 2021-07-05T15:51:55.717

import cv2
import numpy as np
import skimage.filters as filters

# read the image
img = cv2.imread("input/ocr.png", 0)

# blur
blurred_dilation = cv2.GaussianBlur(img, (91, 91), 0)

# divide gray by morphology image
division = cv2.divide(img, blurred_dilation, scale=255)

# sharpen using unsharp masking
sharp = filters.unsharp_mask(division, radius=11, amount=11, multichannel=False, preserve_range=False)
sharp = (255 * sharp).clip(0, 255).astype(np.uint8)

# threshold
thresh = cv2.threshold(sharp, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

# save results
cv2.imwrite('receipt_division_sharp.png', sharp)
cv2.imwrite('receipt_division_thresh.png', thresh)

result, result with threshold

method: unsharp_mask filter, Otsu's method (1979)

ref: OpenCV: Contour detection of shadowed image before OCR (2020 stack overflow)

If I were You, I'll try GAN. Though raw data is blurred and shadowed, You need clear data for tesseract. So You need to generate clear character from blurred raw data.

Preprocess bad scans (partially blurred, shadowed and slightly skewed) for OCR

2 Answers2