3

I'm trying to read the text in this image that contains also decimal points and decimal numbers enter image description here

in this way:

img = cv2.imread(path_to_image)
print(pytesseract.image_to_string(img))

and what I get is:

73-82
Primo: 50 —

I've tried to specify also the italian language but the result is pretty similar:

73-82 _
Primo: 50

Searching through other questions on stackoverflow I found that the reading of the decimal numbers can be improved by using a whitelist, in this case tessedit_char_whitelist='0123456789.', but I want to read also the words in the image. Any idea on how to improve the reading of decimal numbers?

marco
  • 525
  • 4
  • 11

2 Answers2

3

I would suggest passing tesseract every row of text as separate image.
For some reason it seams to solve the decimal point issue...

  • Convert image from grayscale to black and white using cv2.threshold.
  • Use cv2.dilate morphological operation with very long horizontal kernel (merge blocks across horizontal direction).
  • Use find contours - each merged row is going to be in a separate contour.
  • Find bounding boxes of the contours.
  • Sort the bounding boxes according to the y coordinate.
  • Iterate bounding boxes, and pass slices to pytesseract.

Here is the code:

import numpy as np
import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  # I am using Windows

path_to_image = 'image.png'

img = cv2.imread(path_to_image, cv2.IMREAD_GRAYSCALE)  # Read input image as Grayscale

# Convert to binary using automatic threshold (use cv2.THRESH_OTSU)
ret, thresh = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

# Dilate thresh for uniting text areas into blocks of rows.
dilated_thresh = cv2.dilate(thresh, np.ones((3,100)))


# Find contours on dilated_thresh
cnts = cv2.findContours(dilated_thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)[-2]  # Use index [-2] to be compatible to OpenCV 3 and 4

# Build a list of bounding boxes
bounding_boxes = [cv2.boundingRect(c) for c in cnts]

# Sort bounding boxes from "top to bottom"
bounding_boxes = sorted(bounding_boxes, key=lambda b: b[1])


# Iterate bounding boxes
for b in bounding_boxes:
    x, y, w, h = b

    if (h > 10) and (w > 10):
        # Crop a slice, and inverse black and white (tesseract prefers black text).
        slice = 255 - thresh[max(y-10, 0):min(y+h+10, thresh.shape[0]), max(x-10, 0):min(x+w+10, thresh.shape[1])]

        text = pytesseract.image_to_string(slice, config="-c tessedit"
                                                          "_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890-:."
                                                          " --psm 3"
                                                          " ")

        print(text)

I know it's not the most general solution, but it manages to solve the sample you have posted.
Please treat the answer as a conceptual solution - finding a robust solution might be very challenging.


Results:

Thresholder image after dilate:
enter image description here

First slice:
enter image description here

Second slice:
enter image description here

Third slice:
enter image description here

Output text:

7.3-8.2

Primo:50

fmw42
  • 46,825
  • 10
  • 62
  • 80
Rotem
  • 30,366
  • 4
  • 32
  • 65
  • Thank you, this is really awesome since it allows to remove external noise from image. I was wondering if it works also on images that have more than 3 rows since I'm working with a bunch of images and most of the time I have images with less or more than 3 rows – marco Mar 08 '21 at 08:38
  • Sure, it's going to work with more than 3 rows, but there are many cases the solution is not going to work. Examples: I used some specific constant values that doesn't apply all cases. It's not going to work when the text is not arranged in horizontal rows. There are also Tesseract limitations - I don't know the cases where Tesseract OCR fails to recognize a decimal point... – Rotem Mar 08 '21 at 11:01
  • I tried and it works fine with more or less rows but in some cases Tessaract fails to recognize the text taking the single row and not the entire caption – marco Mar 08 '21 at 11:28
  • You are going to have to debug it. You may start by displaying each "slice" that passed to `pytesseract.image_to_string`. Add `cv2.imshow('slice', slice)` and `cv2.waitKey()` (add it before `pytesseract.image_to_string`). Make sure the "slice" image contains the expected text. You may also try down-sampling, like Ahx suggested. – Rotem Mar 08 '21 at 11:40
  • Yes, I did it but for some reason if a slice contains "Si: 10" Tesseract read it as an empty string. Now it is hard to explain without showing a graphical example, maybe I'll open a new question. On the other hand your code works really well – marco Mar 08 '21 at 12:08
2

You can easily recognize by down-sampling the image.

If you down-sample by 0.5, result will be:

enter image description here

Now if you read:

7.3 - 8.2
Primo: 50

I got the result by using pytesseract 0.3.7 version (current)

Code:


# Load the libraries
import cv2
import pytesseract

# Load the image
img = cv2.imread("s9edQ.png")

# Convert to the gray-scale
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Down-sample
gry = cv2.resize(gry, (0, 0), fx=0.5, fy=0.5)

# OCR
txt = pytesseract.image_to_string(gry)
print(txt)

Explanation:


The input-image contains a little bit of an artifact. You can see it on the right part of the image. On the other hand, the current image is perfect for OCR recognition. You need to use the pre-preprocessing method when the data from the image is not visible or corrupted. Please read the followings:

Ahmet
  • 7,527
  • 3
  • 23
  • 47
  • Thank you! I did an upsampling by 2 before reading the image since I read that it can improve the recognition but I guess it was too much – marco Mar 08 '21 at 08:41
  • You did upsample before reading the image? Did you try OCR before the upsampling? – Ahmet Mar 08 '21 at 08:45
  • yes but it some cases the upsampling helped me to read the image correctly – marco Mar 08 '21 at 08:51