How to edit the image so Tesseract OCR recognizes it? (Python)

Question

Here is the image I am trying tesseract to detect:

So far, I have tried greyscaling, inverting, blurring, thresholding and still no recognition.
Is there something I'm missing that prevents it from recognizing or is there something (or any resources) that might help me detect it?

Since uploading the question I have tried separating the digits and feeding Tesseract, but still wont detect anything.

It doesn't seem difficult to find a very specific solution that works with the image you have posted. In case you are looking for a more general solution, you need to define a set of assumptions we may use, and post few more input images. Please edit your question, and add the source code. — Rotem, Feb 08 '22 at 09:51

Rotem · Answer 1 · 2022-02-08T10:30:23.340

For the specific image you have posted, Tesseract is able to recognize the digits.

Add a "white list" applies only digits: "_char_whitelist=1234567890"
Add --psm 6 argument - assume a single uniform block of text.

Tesseract manages to identify the text without cleaning:

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"  # For Windows OS

# Read input image
img = cv2.imread("num30.jpg")


# Apply OCR
data = pytesseract.image_to_string(img, config="-c tessedit"
                                               "_char_whitelist=1234567890"
                                               " --psm 6"
                                               " ")

print(data)

Output:

30

Example for cleaning up the input image:

Assumptions:

The dark pixels next to the image borders are not part of the text.
Small clusters applies noise.

The cleaning process may apply two stages:

Iterating the most top row, bottom row, left column and right column.
Apply cv2.floodFill (fill with white) when a pixel is dark.
Find small clusters using cv2.connectedComponentsWithStats, and fill the small clusters with white color.

Code sample:

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"  # For Windows OS


img = cv2.imread("num30.jpg")  # Read input image

gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)  # Convert to grayscale.

for x in range(gray.shape[1]):
    # Fill dark top pixels:
    if gray[0, x] < 200:
        cv2.floodFill(gray, None, seedPoint=(x, 0), newVal=255, loDiff=3, upDiff=3)  # Fill the background with white color

    # Fill dark bottom pixels:
    if gray[-1, x] < 200:
        cv2.floodFill(gray, None, seedPoint=(x, gray.shape[0]-1), newVal=255, loDiff=3, upDiff=3)  # Fill the background with white color

for y in range(gray.shape[0]):
    # Fill dark left side pixels:
    if gray[y, 0] < 200:
        cv2.floodFill(gray, None, seedPoint=(0, y), newVal=255, loDiff=3, upDiff=3)  # Fill the background with white color

    # Fill dark right side pixels:
    if gray[y, -1] < 200:
        cv2.floodFill(gray, None, seedPoint=(gray.shape[1]-1, y), newVal=255, loDiff=3, upDiff=3)  # Fill the background with white color

cv2.imshow('gray after floodFill', gray)  # Show image for testing

# Convert to binary and invert polarity
ret, thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

# Find connected components (clusters)
nlabel, labels, stats, centroids = cv2.connectedComponentsWithStats(thresh, connectivity=8)

# Remove small clusters: With both width<=10 and height<=10 (clean small size noise).
for i in range(nlabel):
    if (stats[i, cv2.CC_STAT_WIDTH] <= 10) and (stats[i, cv2.CC_STAT_HEIGHT] <= 10):
        thresh[labels == i] = 0

cv2.imshow('thresh', thresh)  # Show image for testing

# Put 255 where thresh is zero.s
gray[thresh == 0] = 255

cv2.imshow('gray', gray)

# Apply OCR
data = pytesseract.image_to_string(img, config="-c tessedit"
                                               "_char_whitelist=1234567890"
                                               " --psm 6"
                                               " ")

print(data)

cv2.waitKey()
cv2.destroyAllWindows()

Output image:

How to edit the image so Tesseract OCR recognizes it? (Python)

1 Answers1