For the specific image you have posted, Tesseract is able to recognize the digits.
- Add a "white list" applies only digits:
"_char_whitelist=1234567890"
- Add
--psm 6
argument - assume a single uniform block of text.
Tesseract manages to identify the text without cleaning:
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe" # For Windows OS
# Read input image
img = cv2.imread("num30.jpg")
# Apply OCR
data = pytesseract.image_to_string(img, config="-c tessedit"
"_char_whitelist=1234567890"
" --psm 6"
" ")
print(data)
Output:
30
Example for cleaning up the input image:
Assumptions:
- The dark pixels next to the image borders are not part of the text.
- Small clusters applies noise.
The cleaning process may apply two stages:
- Iterating the most top row, bottom row, left column and right column.
Apply cv2.floodFill
(fill with white) when a pixel is dark.
- Find small clusters using
cv2.connectedComponentsWithStats
, and fill the small clusters with white color.
Code sample:
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe" # For Windows OS
img = cv2.imread("num30.jpg") # Read input image
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # Convert to grayscale.
for x in range(gray.shape[1]):
# Fill dark top pixels:
if gray[0, x] < 200:
cv2.floodFill(gray, None, seedPoint=(x, 0), newVal=255, loDiff=3, upDiff=3) # Fill the background with white color
# Fill dark bottom pixels:
if gray[-1, x] < 200:
cv2.floodFill(gray, None, seedPoint=(x, gray.shape[0]-1), newVal=255, loDiff=3, upDiff=3) # Fill the background with white color
for y in range(gray.shape[0]):
# Fill dark left side pixels:
if gray[y, 0] < 200:
cv2.floodFill(gray, None, seedPoint=(0, y), newVal=255, loDiff=3, upDiff=3) # Fill the background with white color
# Fill dark right side pixels:
if gray[y, -1] < 200:
cv2.floodFill(gray, None, seedPoint=(gray.shape[1]-1, y), newVal=255, loDiff=3, upDiff=3) # Fill the background with white color
cv2.imshow('gray after floodFill', gray) # Show image for testing
# Convert to binary and invert polarity
ret, thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
# Find connected components (clusters)
nlabel, labels, stats, centroids = cv2.connectedComponentsWithStats(thresh, connectivity=8)
# Remove small clusters: With both width<=10 and height<=10 (clean small size noise).
for i in range(nlabel):
if (stats[i, cv2.CC_STAT_WIDTH] <= 10) and (stats[i, cv2.CC_STAT_HEIGHT] <= 10):
thresh[labels == i] = 0
cv2.imshow('thresh', thresh) # Show image for testing
# Put 255 where thresh is zero.s
gray[thresh == 0] = 255
cv2.imshow('gray', gray)
# Apply OCR
data = pytesseract.image_to_string(img, config="-c tessedit"
"_char_whitelist=1234567890"
" --psm 6"
" ")
print(data)
cv2.waitKey()
cv2.destroyAllWindows()
Output image:
