6

PLease download the attatchment here and save it as /tmp/target.jpg.

enter image description here
You can see that there are 0244R in the jpg,i extract string with below python code:

from PIL import Image
import pytesseract
import cv2
filename = "/tmp/target.jpg"
image = cv2.imread(filename)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
ret, threshold = cv2.threshold(gray,55, 255, cv2.THRESH_BINARY)
print(pytesseract.image_to_string(threshold))

What I get is

0244K

The right string is 0244R,how to make image more contrast, grayscale then get all characters exactly with with PIL and pytesseract? Here is the webpage which generate the image :

http://www.crup.cn/ValidateCode/Index?t=0.14978241776661583

Community
  • 1
  • 1
showkey
  • 482
  • 42
  • 140
  • 295
  • Risks are that people will provide a solution that only works on this image. Do you have the code that generates this image ? – Gabriel Devillers Oct 16 '19 at 11:47
  • You were shown the process required to perform this type of cleaning the last time you asked this question https://stackoverflow.com/questions/57183997/why-cant-get-string-with-pil-and-pytesseract. It's not a perfect process. – Trenton McKinney Oct 18 '19 at 06:42
  • 1
    @potential answerers, this is a robot validation for account creation and login at China Renmin University Press http://www.crup.cn/Account/Login I cannot know what the OP intends to use this for, but if you are in China, aiding the OP in circumventing this may not be kosher. – Him Oct 18 '19 at 17:27
  • @Scott That's a good FYI! – Trenton McKinney Oct 18 '19 at 17:41
  • Everyday i login into the website to get points by hand ,i am tired of that action,i want to write a program to login and get points for me. – showkey Oct 21 '19 at 04:28

1 Answers1

0

If you apply adaptive-thresholding and bitwise-not operations to the input image, the result will be:

enter image description here

Now if you remove the special characters like (dot, comma, etc..)

txt = pytesseract.image_to_string(bnt, config="--psm 6")
res = ''.join(i for i in txt if i.isalnum())
print(res)

Result will be:

O244R

Code:


import cv2
import pytesseract

img = cv2.imread("Aw6sN.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thr = cv2.adaptiveThreshold(gry, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
                            cv2.THRESH_BINARY_INV, 23, 100)
bnt = cv2.bitwise_not(thr)
txt = pytesseract.image_to_string(bnt, config="--psm 6")
res = ''.join(i for i in txt if i.isalnum())
print(res)
Ahmet
  • 7,527
  • 3
  • 23
  • 47