Clear inconsistent gray background for OCR reading

Question

Some 2 years ago, I asked a question here and got a satisfying answer. Think is, recently the script has been returning a lot of errors, over 30%, so I decided to change my approach and just ask a more generic question, thinking with the original images instead of the processed ones I used in my original question.

Here are the originals:

As you can see, these examples are slices of the original scanned documents.

The problem lies in their inconsistent quality, both in the original printing and the subsquent scanning. Sometimes the digits stand out, sometimes not. Sometimes I have a darker gray, sometimes lighter. Sometimes I get a faulty print, with white lines showing where the printer failed to put ink.

Furthermore, their font is way to "tight", as in, the digits are too close to each other, sometimes even touching, precluding me from simply separating each digit in order to clean and OCR individualy.

I've tried various approaches with OpenCV, such as various blurs:

gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
blurred = cv2.GaussianBlur(gray,(5,5),0) # Innitial cleaning
s_thresh = cv2.threshold(blurred, 120, 255, cv2.THRESH_BINARY_INV)[1]
o_thresh = cv2.threshold(blurred, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)[1]
ac_thres = cv2.adaptiveThreshold(blurred,255,cv2.ADAPTIVE_THRESH_MEAN_C,cv2.THRESH_BINARY_INV,5,10)
ag_thres = cv2.adaptiveThreshold(blurred, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY_INV, 5, 4)

And also connected components:

ret, thresh = cv2.threshold(img, 100, 255, cv2.THRESH_BINARY)
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, cv2.getStructuringElement(cv2.MORPH_RECT, (2,2))
gray_img = cv2.cvtColor(opening, cv2.COLOR_BGR2GRAY)
_, blackAndWhite = cv2.threshold(gray_img, 127, 255, cv2.THRESH_BINARY_INV)

nlabels, labels, stats, centroids = cv2.connectedComponentsWithStats(blackAndWhite, None, None, None, 8, cv2.CV_32S)
sizes = stats[1:, -1]  # get CC_STAT_AREA component
img2 = np.zeros((labels.shape), np.uint8)

for i in range(0, nlabels - 1):
    if sizes[i] >= 4:  
        img2[labels == i + 1] = 255

res = cv2.bitwise_not(img2)
gaussian = cv2.GaussianBlur(res, (3, 3), 0)

unsharp_image = cv2.addWeighted(res, 0.3, gaussian, 0.7, 0, res)

But I still get results that are inconsistent at best.

Should I change my approach? What would you guys recommend?

_slight_ gaussian blur should subdue the halftoning and moire patterns. then it's just a threshold... as for separating digits: these things are touching. OCR's job is to tolerate that. — Christoph Rackwitz, May 16 '22 at 09:26
Believe me, it does not tolerate. Sometimes I got a 4 out of a 22 (touching). — SteelMasimo, May 17 '22 at 13:22
meaning that's the job/responsibility of OCR because only OCR knows what it's supposed to see. any preprocessing _can't_ fix that. I know tesseract sucks. there are freely available alternatives that blow it out of the water (easyocr). — Christoph Rackwitz, May 17 '22 at 19:48

stateMachine · Answer 1 · 2022-05-16T04:32:12.823

Here's a revisited approach to my original answer (now, implemented fully in Python!). I'm using the K channel of the CMYK color space to get a binary image. The binary image is obtained via Otsu Thresholding + a little bit of bias, apply a minimum area filter and then I invert the image and pass it to teserract.

I'm using a couple of libraries here. imutils for reading images in a directory, os for joining paths and pytesseract for the OCR. Let's see the code:

# Imports:
import pytesseract  # tesseract (previous installation)
import numpy as np  # numpy
import cv2  # opencv
import os  # os for paths
from imutils import paths

# Image path:
rootDir = "D:"
baseDir = "opencvImages"
subBaseDir = "numbers"

# Otsu bias:
threshBias = 1.2

# Create os-independent path:
path = os.path.join(rootDir, baseDir, subBaseDir)

# Get the test images paths:
imagePaths = sorted(list(paths.list_images(path)))

# Loop over the test images and OCR them:
for imagePath in imagePaths:

    # Load the image via OpenCV:
    currentImage = cv2.imread(imagePath, cv2.IMREAD_COLOR)

    # Show image:
    showImage("Current Image", currentImage)

    # Convert to float and divide by 255:
    imgFloat = currentImage.astype(np.float) / 255.

    # Calculate channel K:
    kChannel = 1 - np.max(imgFloat, axis=2)

    # Convert back to uint 8:
    kChannel = (255 * kChannel).astype(np.uint8)

    # Threshold via Otsu:
    autoThresh, binaryImage = cv2.threshold(kChannel, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    # Add a little bias and threshold again:
    autoThresh = threshBias * autoThresh
    _, binaryImage = cv2.threshold(kChannel, autoThresh, 255, cv2.THRESH_BINARY)
    showImage("Current Image (Binary)", binaryImage)

    # Apply a filter area of minimum 50 pixels:
    minArea = 50
    binaryImage = areaFilter(binaryImage, minArea)
    showImage("Current Image (Filtered)", binaryImage)

    # Invert Image:
    binaryImage = 255 - binaryImage

    # Setting up tesseract:
    pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  # for Windows
    custom_config = r'--oem 3 --psm 6'
    text = pytesseract.image_to_string(binaryImage, config=custom_config)

    # Show recognized text:
    print("Text is: " + text)

There are a couple of defined functions. showImage is just my custom function to show an image in a window via OpenCV's High-level GUI. After a window pop ups, press any key to continue evaluating the script. The areaFilter function is the same function from before. It applies a minimum area filter to the binary image:

# Defines a re-sizable image window:
def showImage(imageName, inputImage):
    cv2.namedWindow(imageName, cv2.WINDOW_NORMAL)
    cv2.imshow(imageName, inputImage)
    cv2.waitKey(0)

# Applies a minimum blob area filter to an input binary image:
def areaFilter(binaryImage, minArea):
    totalComponents, labeledPixels, componentsStats, componentsCentroids = cv2.connectedComponentsWithStats(binaryImage,
                                                                                                            connectivity=4)
    remaining_comp_labels = [i for i in range(1, totalComponents) if componentsStats[i][4] >= minArea]
    outImage = np.where(np.isin(labeledPixels, remaining_comp_labels) == True, 255, 0).astype(np.uint8)
    return outImage

Let's check out some results. For the first image, this is the K (black) channel only:

This is the pre-filtered binary image (Otsu + bias):

This is the filtered image:

Teserract returns this:

Text is: 820065084250

The strings returned for every image, according to Teserract, are:

Tesseract OCR
820065084250
930023482930
820065085833
930023485203
820065072022
930023485564
820065084802
820065084691
820065084730
930023445422
820065084551
82006507 1840

Note that the 2 in 820065084551 is successfully recognized, even though the number is partly cut. There's white space in the last string probably because the numbers on the image are a little bit separated. You can post-process the string to remove these white spaces.

Sorry, I'm not following you. Are you referring to the images in my answer? There are only three images, and all of them have the same width. — stateMachine, May 17 '22 at 01:21
The next image The strings returned for every image, according to Teserract, are: — toyota Supra, May 17 '22 at 01:22
You mean the table? The table is rendered by Stack Overflow, I'm seeing all the data in one same-width column. — stateMachine, May 17 '22 at 01:24
Funny story, since my original question I've actualy read a couple of books on C++, and tried your orginal answer adapted to python. I'll implement this one with the code and test on a much larger sample. — SteelMasimo, May 17 '22 at 13:25

Clear inconsistent gray background for OCR reading

1 Answers1