I am using wand and pytesseract to get the text of pdfs uploaded to a django website like so:
image_pdf = Image(blob=read_pdf_file, resolution=300)
image_png = image_pdf.convert('png')
req_image = []
final_text = []
for img in image_png.sequence:
img_page = Image(image=img)
req_image.append(img_page.make_blob('png'))
for img in req_image:
txt = pytesseract.image_to_string(PI.open(io.BytesIO(img)).convert('RGB'))
final_text.append(txt)
return " ".join(final_text)
I have it running in celery in a separate ec2 server. However, because the image_pdf grows to approximately 4gb for even a 13.7 mb pdf file, it is being stopped by the oom killer. Instead of paying for higher ram, I want to try to reduce the memory used by wand and ImageMagick. Since it is already async I don't mind increased computation times. I have skimmed this: http://www.imagemagick.org/Usage/files/#massive, but am not sure if it can be implemented with wand. Another possible fix is a way to open a pdf in wand one page at a time rather than putting the full image into RAM at once. Alternatively, how could I interface with ImageMagick directly using python so that I could use these memory limiting techniques?