Pytesseract doesn't find any text only on some files

Question

I have the following code and the problem is, that on some images the return value is empty. The structure of the images is always the same. it is plain black text on white background. Clearly readable. 50% of the results are excellent and other ones are just empty.

the only error I get is:

wand/image.py:4623: CoderWarning: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `filename.png' @ warning/png.c/MagickPNGWarningHandler/1747 self.raise_exception()

But it raises this error every time, even if the output is fine.

def retrievetext(self,docname):
    r = BytesIO()
    self.ftp.retrbinary("RETR /httpdocs/"+docname , r.write )  
    r.seek(0)
    with wi(file=r, resolution = 400) as pdf:
        pdfImage = pdf.convert('png')

    imageBlobs = []

    for img in pdfImage.sequence:

        imgPage = wi(image = img)
        imgPage.crop(left=200,top=600,width=1800,height=800)
        imageBlobs.append(imgPage.make_blob('png'))

    recognized_text = []
    for imgBlob in imageBlobs:
        im = Image.open(BytesIO(imgBlob))
        im = im.convert('L')
        text = pytesseract.image_to_string(im, lang = 'deu')
        recognized_text.append(text)

   return recognized_text

Does somebody have an idea how to improve the results?

Best regards

I've found out that if I reduce the resolution to 150 nothing changes but if i additionaly change the crop size to something better, so i have a lot more text in it, it finds everything. when i reduce the crop back to the textposition I want, it returns nothing again — Rune, Jan 30 '19 at 07:23
Alderven, you can find an example here: https://ibb.co/j3GJTZw — Rune, Jan 30 '19 at 07:27
Works perfect for me on the sample image without and manipulations with an image — Alderven, Jan 30 '19 at 07:34
Ok, example-image was an example. All files have the same structure, some are working fine, some don't. The example works for me too. So where is the reason for it. I can't share "real examples" — Rune, Jan 30 '19 at 07:43

score 1 · Accepted Answer · answered Jan 30 '19 at 09:00

1

Some of your images are in Grayscale mode. So you need to convert them first to RGBA format before sending to pytesseract:

img = Image.open('example2.png')
rgbimg = Image.new('RGBA', img.size)
rgbimg.paste(img)
text = pytesseract.image_to_string(rgbimg, lang='deu')
print(text)

answered Jan 30 '19 at 09:00

Alderven

7,569
5
26
38

ok, for me all images looked the same?! The results are far better then before. Thank you! One more problem. I crop the image imgPage.crop(left=200,top=600,width=1800,height=800) and the png shows the whole area I need, but tesseract only prints a part of it. By editing the crop-settings the results change, even the png still shows the same, only in a different area of the png – Rune Jan 30 '19 at 12:39
Sound like you need to ask new question with example of input image, expected and actual text. – Alderven Jan 30 '19 at 13:01
@Alderven do you know if `pytesseract` just works on color images? – Ricardo Barros Lourenço Dec 11 '19 at 11:57
thanks... what would be the equivalent in wand please? – Marcelo Gazzola Jan 26 '20 at 16:22

Pytesseract doesn't find any text only on some files

1 Answers1