Tesseract is detecting 1 as t

Question

I am trying to extract emails from screenshots.

You can see in this image, there is an email.

This is my code-

image = cv2.imread('image_name.jpg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
thresh = 255 - thresh
text = pytesseract.image_to_string(thresh, config = '--psm 6')

Tried everything from grayscale to thresholding to inverse but nothing seems to work.

Earlier, it was detecting 5 as 's' and 1 as 'i', but after pre-processing the image as shown above, only the problem with 5 is resolved, but now detects 1 as 't'. Please help.

Tried every pre-processing technique I could find.

Edit 1 : First of all, I am a complete beginner, so I might say something that may be completely childish in programming world. So, please bear with me.

These are some of the results of image_to_data function on the image- email email string itself & contact yet

I would have posted the result of pre-processing image but it shows this error when I am trying to run cv2.imshow() -

The function is not implemented. Rebuild the library with Windows, GTK+ 2.x or Cocoa support. If you are on Ubuntu or Debian, install libgtk2.0-dev and pkg-config, then re-run cmake or configure script in function 'cvShowImage'

I am running jupyter notebook on Anaconda, that could be the reason of this error.

Here is the image after processing- Image After Processing

It does the best it can, but text recognition is never perfect. — Tim Roberts, Dec 23 '22 at 19:58
Try running image_to_data functions, to get better information about each character pytesseract recognizes, and their boundaries, post it here. Also, it wouldn't hurt if you posted some kind of test image after preprocessing, so we can use it locally to help you. — LordNani, Dec 23 '22 at 19:58
I am interested in results of image_to_data function, confidences, boundaries, and characters — LordNani, Dec 23 '22 at 19:59
`I would have posted the result of pre-processing image but it shows this error` Note that you can also export the image by using `cv2.imwrite()`. — Nick ODell, Dec 23 '22 at 21:01
Thankyou for the advice @LordNani . I ran the image_to_data function and I have also posted some results above in the post. Also I couldn't upload image after processing as I am unable to see it myself due to the error also mentioned above. — ajaygarg, Dec 23 '22 at 21:02
Thanks @NickODell . I am able to get the image. Posting it now. — ajaygarg, Dec 23 '22 at 21:05

score 1 · Answer 1 · answered Dec 23 '22 at 20:02

I have used Tesseract some and the best advice I got was to make your image fully black and white, and try to sharpen the edges (both of which you can do with opencv). You can also train opencv with a certain font so if all the emails are the same, you can try that.

Also try to keep your image the same size, making it bigger reduces the quality of it. I don't have the means to fully test your image now but those two things helped me a lot. Something to (unfortunately) remember is that it'll never be perfect and you might have to live with that.

Tesseract is detecting 1 as t

1 Answers1