3

I try to read documents from various sources with Python. Therefore I am using OpenCV and Tesseract. To optimize the Tesseract performance, I do some preprocessing, but sadly the documents also vary in quality a lot. My current issue are documents that are only partially blurry or shaded, due to bad scans.

I have no influence on the document quality and manual feature detection is not applicable, because the code should run over hundred thousands of documents in the end and even inside a document the quality can vary strongly.

original bad image

To get rid of the shadow, I found a technique with delating and blurring the image and dividing the original with the dilated version.

h, w = img.shape
kernel = np.ones((7, 7), np.uint8)
dilation = cv2.dilate(img, kernel, iterations=1)
blurred_dilation = cv2.GaussianBlur(dilation, (13, 13), 0)
resized = cv2.resize(blurred_dilation, (w, h))
corrected = img / resized * 255

That works very well.

corrected exposure

But I still got that blur and optically it got worse to read. I'd like to do a binarisation next, but then would be nothing valuable left from the blurred parts.

I found an example of a deconvolution that works for motion blur, but I can only apply it to the whole image, which blurs the rest of the text, and I need to know the direction of the motion blur. So, I hope to get some help on how to optimize this kind of image, so that tesseract can properly read it.

I know that there should be further optimizations besides sharpening the blurred text. Deskewing and getting rid of the fragments of another pages. These I am not sure about the proper sequence how to perform these additional steps.

I can hardly find sources or tutorials for plain document optimization for OCR processes. Often the procedures apply globally to the whole image or are for non OCR applications.

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
Mumpitz
  • 51
  • 5
  • deskew: find peaks in FFT. very robust. http://www.fmwconcepts.com/imagemagick/textdeskew/index.php by the way: hough transform is a bad idea. inaccurate or expensive. also stochastic. – Christoph Rackwitz Jul 01 '21 at 16:20

2 Answers2

2

Reminds me of this article I read a few years ago: https://medium.com/illuin/cleaning-up-dirty-scanned-documents-with-deep-learning-2e8e6de6cfa6

Contrary to the title, it contains a variety of classic computer vision algorithms for your inspiration.

  • To remove shadow, I've personally had median filtering as described (removing a median-filtered background) work more effectively than what you show here.
  • To deskew, I've experimented with Hough transform and got good results.

Intuitively, if you should know the font type and size in advance, that should help as well.

cidermole
  • 5,662
  • 1
  • 15
  • 21
2
import cv2
import numpy as np
import skimage.filters as filters

# read the image
img = cv2.imread("input/ocr.png", 0)

# blur
blurred_dilation = cv2.GaussianBlur(img, (91, 91), 0)

# divide gray by morphology image
division = cv2.divide(img, blurred_dilation, scale=255)

# sharpen using unsharp masking
sharp = filters.unsharp_mask(division, radius=11, amount=11, multichannel=False, preserve_range=False)
sharp = (255 * sharp).clip(0, 255).astype(np.uint8)

# threshold
thresh = cv2.threshold(sharp, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

# save results
cv2.imwrite('receipt_division_sharp.png', sharp)
cv2.imwrite('receipt_division_thresh.png', thresh)

result, result with threshold

method: unsharp_mask filter, Otsu's method (1979)

ref: OpenCV: Contour detection of shadowed image before OCR (2020 stack overflow)

If I were You, I'll try GAN. Though raw data is blurred and shadowed, You need clear data for tesseract. So You need to generate clear character from blurred raw data.