How to make image more contrast, grayscale then get all characters exactly with PIL and pytesseract?

Question

PLease download the attatchment here and save it as /tmp/target.jpg.

You can see that there are 0244R in the jpg,i extract string with below python code:

from PIL import Image
import pytesseract
import cv2
filename = "/tmp/target.jpg"
image = cv2.imread(filename)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
ret, threshold = cv2.threshold(gray,55, 255, cv2.THRESH_BINARY)
print(pytesseract.image_to_string(threshold))

What I get is

0244K

The right string is 0244R,how to make image more contrast, grayscale then get all characters exactly with with PIL and pytesseract? Here is the webpage which generate the image :

http://www.crup.cn/ValidateCode/Index?t=0.14978241776661583

Risks are that people will provide a solution that only works on this image. Do you have the code that generates this image ? — Gabriel Devillers, Oct 16 '19 at 11:47
You were shown the process required to perform this type of cleaning the last time you asked this question https://stackoverflow.com/questions/57183997/why-cant-get-string-with-pil-and-pytesseract. It's not a perfect process. — Trenton McKinney, Oct 18 '19 at 06:42
@potential answerers, this is a robot validation for account creation and login at China Renmin University Press http://www.crup.cn/Account/Login I cannot know what the OP intends to use this for, but if you are in China, aiding the OP in circumventing this may not be kosher. — Him, Oct 18 '19 at 17:27
Everyday i login into the website to get points by hand ,i am tired of that action,i want to write a program to login and get points for me. — showkey, Oct 21 '19 at 04:28

score 0 · Answer 1 · answered Dec 05 '20 at 03:15

If you apply adaptive-thresholding and bitwise-not operations to the input image, the result will be:

Now if you remove the special characters like (dot, comma, etc..)

txt = pytesseract.image_to_string(bnt, config="--psm 6")
res = ''.join(i for i in txt if i.isalnum())
print(res)

Result will be:

O244R

Code:

import cv2
import pytesseract

img = cv2.imread("Aw6sN.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thr = cv2.adaptiveThreshold(gry, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
                            cv2.THRESH_BINARY_INV, 23, 100)
bnt = cv2.bitwise_not(thr)
txt = pytesseract.image_to_string(bnt, config="--psm 6")
res = ''.join(i for i in txt if i.isalnum())
print(res)

How to make image more contrast, grayscale then get all characters exactly with PIL and pytesseract?

1 Answers1