What causes pytesseract to read either the top or bottom text-line of a dual-line image depending on whether opencv or pillow is used?

Question

EDIT: I forgot to process the image which solves the reading issue, thanks to Nathancy. Still wondering what makes Tesseract read only the top OR the bottom line of an unprocessed image (same image, two different outcomes)

Orignal:
I have an image that contains two lines of text: random test image for pytesseract

When I open the image within python (IDLE Python 3.6) with PIL Image and use pytesseract to extract a string, it only extracts the last/bottom line correctly. The upper line of text is scrambled garbage.(see code section below)
However, when I use opencv to open the image and use pytesseract to extract a string, it only extracts the top/upper line correctly whilst making a mess of the second/bottom line of text.(see also code section below)

Here is the code:

>>> from PIL import Image, ImageFilter
>>> import pytesseract
>>> pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
>>> import cv2

>>> img = Image.open(r"C:\Users\user\MyImage.png")
>>> img2 = cv2.imread(r"C:\Users\user\MyImage.png", cv2.IMREAD_COLOR)


>>> print(pytesseract.image_to_string(img2))
Pet Sock has 448/600 HP left
A ae eee PER eats ae

>>> print(pytesseract.image_to_string(img))
Le TL
JHE has 329/350 HP left.

When I use pytesseract.image_to_boxes on both img and img2 it will show the same bounding box for certain locations with a different letter (only showing 2 extracted lines which contain an identical box)

>>> print(pytesseract.image_to_boxes(img2))
A 4 6 10 16 0

>>> print(pytesseract.image_to_boxes(img))
J 4 6 10 16 0

When I use the pytesseract.image_to_data on both img and img2 it shows very high (95+) confidence on the line it reads correctly and very low (30-) on the garbled line.
Excel table output of image_to_data
edit: excel tables are img2 and img accordingly

I fiddled around with the psm config values (I have tried them all) and except for creating more garbage on settings: 5, 7, 8, 9, 10, 13; and some giving an error: 0, 2; it gave no different results than the default (which is 3 I believe)

I must be making some rookie mistake but I can't get my head around why this is happening. If anyone can shine a light in the right direction it would be awesome.

The image was just a fitting, but random, image for an OCR test that I had laying around. No further intentions than experimenting with pytesseract.

score 5 · Answer 1 · answered Nov 12 '19 at 21:47

5

Whenever performing OCR with Pytesseract, it is important to preprocess the image so that the text is in black with the background in white. We can do this with simple thresholding

Output from Pytesseract

Pet Sock has 448/600 HP left
JHE has 329/359 HP left.

Code

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

data = pytesseract.image_to_string(thresh, lang='eng',config='--psm 6')
print(data)

cv2.imshow('thresh', thresh)
cv2.waitKey()

answered Nov 12 '19 at 21:47

nathancy

42,661
14
115
137

Thanks for the processing advise. I forgot to use that. Still I wonder why Tesseract does the weird splitting thing of reading either the top or the bottom row when the image is unprocessed. – non-english-programmer Nov 13 '19 at 11:03
1

It could be due to undefined behavior when the image is not in the correct state for Pytesseract. To avoid the weird behavior, its good practice to preprocess the image before trying to OCR – nathancy Nov 13 '19 at 21:59

What causes pytesseract to read either the top or bottom text-line of a dual-line image depending on whether opencv or pillow is used?

1 Answers1