Why is pytesseract not identifying this image?

Question

I am trying to identify single digits in python with tesseract.

My code is this:

import numpy as np
from PIL import Image
from PIL import ImageOps
import pytesseract
import cv2

def predict(imageArray):
    pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
    newImageArray = Image.open(imageArray)
    number = pytesseract.image_to_string(newImageArray, lang='eng', config='--psm 10 --oem 1 -c tessedit_char_whitelist=0123456789')

    return number

It has no problem saying this is an 8

but it does not recognise this as a 4

My images are just digits 0-9.

This is just one such example there are other instances where it struggles to identify "obvious/clear" digits.

Currently the only thing I am doing to my starting image,image is converting the colour. Using the following:

cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

Is there a way I can improve the accuracy. All of my images are clear computer typed images so I feel the accuracy should be a lot higher than it is.

score 0 · Answer 1 · answered Mar 01 '20 at 16:52

0

You did not provide any information about your tesseract version and language model you used. Best model identify '4' in your image without any preprocessing.

answered Mar 01 '20 at 16:52

user898678

2,994
2
18
17

which is the best model? I am using python with pytesseract and tesseract-ocr-w64-setup-v5.0.0-alpha.20200223. – stgy222 Mar 01 '20 at 17:16
Read tesseract docs: https://tesseract-ocr.github.io/tessdoc/ We will not do it instead of you. – user898678 Mar 09 '20 at 07:32

Why is pytesseract not identifying this image?

1 Answers1