0

I am trying to identify single digits in python with tesseract.

My code is this:

import numpy as np
from PIL import Image
from PIL import ImageOps
import pytesseract
import cv2

def predict(imageArray):
    pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
    newImageArray = Image.open(imageArray)
    number = pytesseract.image_to_string(newImageArray, lang='eng', config='--psm 10 --oem 1 -c tessedit_char_whitelist=0123456789')

    return number

It has no problem saying this is an 8

Image1

but it does not recognise this as a 4

Image2

My images are just digits 0-9.

This is just one such example there are other instances where it struggles to identify "obvious/clear" digits.

Currently the only thing I am doing to my starting image,image is converting the colour. Using the following:

cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

Is there a way I can improve the accuracy. All of my images are clear computer typed images so I feel the accuracy should be a lot higher than it is.

Adrian W
  • 4,563
  • 11
  • 38
  • 52
stgy222
  • 27
  • 1
  • 4

1 Answers1

0

You did not provide any information about your tesseract version and language model you used. Best model identify '4' in your image without any preprocessing.

user898678
  • 2,994
  • 2
  • 18
  • 17
  • which is the best model? I am using python with pytesseract and tesseract-ocr-w64-setup-v5.0.0-alpha.20200223. – stgy222 Mar 01 '20 at 17:16
  • Read tesseract docs: https://tesseract-ocr.github.io/tessdoc/ We will not do it instead of you. – user898678 Mar 09 '20 at 07:32