0

I'm trying to convert this list of images I have to text. The images are fairly small but VERY readable (15x160, with only grey text and a white background) I can't seem to get pytesseract to read the image properly. I tried to increase the size with .resize() but it didn't seem to do much at all. Here's some of my code. Anything new I can add to increase my chances? Like I said, I'm VERY surprised that pytesseract is failing me here, it's small but super readable compared to some of the things I've seem it catch.

for dImg in range(0, len(imgList)):
    url = imgList[dImg]
    local = "img" + str(dImg) + ".jpg"
    urllib.request.urlretrieve(url, local)
    imgOpen = Image.open(local)
    imgOpen.resize((500,500))
    imgToString = pytesseract.image_to_string(imgOpen)
    newEmail.append(imgToString)
bake
  • 55
  • 1
  • 9

2 Answers2

0

Setting the Page Segmentation Mode (psm) can probably help.

To get all the available psm enter tesseract --help-psm in your terminal.

Then identify the psm corresponding to your need. Lets say you want to treat the image as a single text line, in that case your ImgToString becomes:

imgToString = pytesseract.image_to_string(imgOpen, config = '--psm 7')

Hope this will help you.

sixela
  • 106
  • 1
  • 7
0

You can perform several pre-processing steps in your code.

1) Use the from PIL import Image and use your_img.convert('L'). There are several other settings you can check.

2) A bit advanced method: Use a CNN. There are several pre-trained CNNs you can use. Here you can find a little bit more detailed information: https://www.cs.princeton.edu/courses/archive/fall00/cs426/lectures/sampling/sampling.pdf

tifi

tifi90
  • 403
  • 4
  • 13