6

I am trying to write code in Python for the manual Image preprocessing and recognition using Tesseract-OCR.

Manual process:
For manually recognizing text for a single Image, I preprocess the Image using Gimp and create a TIF image. Then I feed it to Tesseract-OCR which recognizes it correctly.

To preprocess the image using Gimp I do -

  1. Change mode to RGB / Grayscale
    Menu -- Image -- Mode -- RGB
  2. Thresholding
    Menu -- Tools -- Color Tools -- Threshold -- Auto
  3. Change mode to Indexed
    Menu -- Image -- Mode -- Indexed
  4. Resize / Scale to Width > 300px
    Menu -- Image -- Scale image -- Width=300
  5. Save as Tif

Then I feed it tesseract -

$ tesseract captcha.tif output -psm 6

And I get an accurate result all the time.

Python Code:
I have tried to replicate above procedure using OpenCV and Tesseract -

def binarize_image_using_opencv(captcha_path, binary_image_path='input-black-n-white.jpg'):
    im_gray = cv2.imread(captcha_path, cv2.CV_LOAD_IMAGE_GRAYSCALE)
    (thresh, im_bw) = cv2.threshold(im_gray, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
    # although thresh is used below, gonna pick something suitable
    im_bw = cv2.threshold(im_gray, thresh, 255, cv2.THRESH_BINARY)[1]
    cv2.imwrite(binary_image_path, im_bw)

    return binary_image_path

def preprocess_image_using_opencv(captcha_path):
    bin_image_path = binarize_image_using_opencv(captcha_path)

    im_bin = Image.open(bin_image_path)
    basewidth = 300  # in pixels
    wpercent = (basewidth/float(im_bin.size[0]))
    hsize = int((float(im_bin.size[1])*float(wpercent)))
    big = im_bin.resize((basewidth, hsize), Image.NEAREST)

    # tesseract-ocr only works with TIF so save the bigger image in that format
    tif_file = "input-NEAREST.tif"
    big.save(tif_file)

    return tif_file

def get_captcha_text_from_captcha_image(captcha_path):

    # Preprocess the image befor OCR
    tif_file = preprocess_image_using_opencv(captcha_path)

    #   Perform OCR using tesseract-ocr library
    # OCR : Optical Character Recognition
    image = Image.open(tif_file)
    ocr_text = image_to_string(image, config="-psm 6")
    alphanumeric_text = ''.join(e for e in ocr_text)

    return alphanumeric_text    

But I am not getting the same accuracy. What did I miss?

Update 1:

  1. Original Image
    enter image description here
  2. Tif Image created using Gimp
    enter image description here
  3. Tif Image created by my python code
    enter image description here

Update 2:

This code is available at https://github.com/hussaintamboli/python-image-to-text

Community
  • 1
  • 1
Hussain
  • 5,057
  • 6
  • 45
  • 71
  • 2
    try matching the output of your python script and the gimp, at various stages such as `comparing the binary outputs`, etc. – ZdaR Sep 09 '15 at 08:17
  • I can see that the Tifs don't look same – Hussain Sep 09 '15 at 08:25
  • 1
    Then probably there is problem with your thresholding procedure, you need to analyse How the auto thresholding in GIMP actually works in the backend, Can you attach necessary images along with the question ? – ZdaR Sep 09 '15 at 08:39
  • 1
    The only difference between GIMP and your python implementation is of a extra border added in the python image, and in the GIMP output the strokes of text are quite smooth. I would suggest you to get rid of the extra border. – ZdaR Sep 09 '15 at 08:58
  • Yes. I can see that. I'll try to remove those strokes. I don't think the border is causing any issue because the text that python code has recognized is only one char wrong. Can you give some more hints or code snippets? – Hussain Sep 09 '15 at 09:06
  • Which character is mismatched, precisely ? – ZdaR Sep 09 '15 at 09:30
  • It gives the output - 88BC'7F. (Note the extra single quote from the recognized text) – Hussain Sep 09 '15 at 11:35
  • You may try some techniques such as `erosion` and `dilation`, to fill up the holes as well as to remove the small back dots respectively. – ZdaR Sep 10 '15 at 14:00
  • Update: Please check https://github.com/hussaintamboli/python-image-to-text – Hussain Feb 12 '18 at 08:53
  • Wow, looks like you are trying to write a program to create a robot when recaptcha is used. – Eamonn Kenny Apr 17 '18 at 14:59
  • @Hussain have you had issues with bounding boxes? – Godfather Jul 30 '19 at 20:56

2 Answers2

1

If the output is only minimally deviating from your expected output (i.e. extra '," etc. as suggested in your comments) try limiting character recognition to the character set you expect (e.g. alphanumeric).

Aurelian
  • 11
  • 2
  • I was trying this long time ago. You can check my code https://github.com/hussaintamboli/python-image-to-text. I'll be happy to merge any improvements if you raise PR – Hussain Feb 12 '18 at 08:55
1

You have already applied the simple thresholding. The missing part is you need to read the images one-by-one

For each single-digit

    1. Upsample
    1. Add border

Upsampling is required for accurate recognition. Adding border to the image will center the digit.

enter image description here enter image description here enter image description here enter image description here enter image description here enter image description here
8 8 B C 7 F

Code:


import cv2
import pytesseract

img = cv2.imread('Iv5BS.jpg')
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thr = cv2.threshold(gry, 128, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

(h_thr, w_thr) = thr.shape[:2]
s_idx = 2
e_idx = int(w_thr/6) - 20
result = ""

for _ in range(0, 6):
    crp = thr[5:int((6*h_thr)/7), s_idx:e_idx]
    (h_crp, w_crp) = crp.shape[:2]
    crp = cv2.resize(crp, (w_crp*2, h_crp*2))
    crp = cv2.copyMakeBorder(crp, 10, 10, 10, 10, cv2.BORDER_CONSTANT, value=255)
    s_idx = e_idx
    e_idx = s_idx + int(w_thr/6) - 7
    txt = pytesseract.image_to_string(crp, config="--psm 6")
    result += txt[0]
    cv2.imshow("crp", crp)
    cv2.waitKey(0)

print(result)

Result:

88BC7F
Ahmet
  • 7,527
  • 3
  • 23
  • 47