0

I'm trying to resolve captcha's text but unfornutely it's not perfect. I'm using pytesseract 0.3.8, Python 3.9 and tesseract v5.0.0-alpha.20210506 under Windows 10 x64.

Captcha1

Captcha2

Captcha3

My Code :

image = Image.open(path).convert('RGB')
    image = ImageOps.autocontrast(image)
 
    fct.creerDossierSiInexistant("captchas")
    filename = "{}.png".format(os.getpid())
    image.save("captchas\\" + filename)
    pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe'
    
    text = pytesseract.image_to_string(Image.open("captchas\\" + filename))

Thanks !

Kin
  • 9
  • 1
  • It will probably not work out just making adjustments on the image (like threshold and sharpen) and calling tesseract. For tasks such yours, it's better to either train tesseract or apply cv2 methods. You could also try, as a quick fix, to split chars found on image and run tesseract on each one. Take a look at https://stackoverflow.com/questions/56305445/how-to-extract-dotted-text-from-image – Kfcaio Jul 16 '21 at 15:39

1 Answers1

2

Tesseract is not designed to break captcha. Tesseract expects clear images without minimum artifacts.

If a captcha is implemented, there is a reason for it. Instead of breaking it, contact the admin of the site to agree about cooperation instead of breaking rules.

user898678
  • 2,994
  • 2
  • 18
  • 17