I have the following code and the problem is, that on some images the return value is empty. The structure of the images is always the same. it is plain black text on white background. Clearly readable. 50% of the results are excellent and other ones are just empty.
the only error I get is:
wand/image.py:4623: CoderWarning: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `filename.png' @ warning/png.c/MagickPNGWarningHandler/1747 self.raise_exception()
But it raises this error every time, even if the output is fine.
def retrievetext(self,docname):
r = BytesIO()
self.ftp.retrbinary("RETR /httpdocs/"+docname , r.write )
r.seek(0)
with wi(file=r, resolution = 400) as pdf:
pdfImage = pdf.convert('png')
imageBlobs = []
for img in pdfImage.sequence:
imgPage = wi(image = img)
imgPage.crop(left=200,top=600,width=1800,height=800)
imageBlobs.append(imgPage.make_blob('png'))
recognized_text = []
for imgBlob in imageBlobs:
im = Image.open(BytesIO(imgBlob))
im = im.convert('L')
text = pytesseract.image_to_string(im, lang = 'deu')
recognized_text.append(text)
return recognized_text
Does somebody have an idea how to improve the results?
Best regards