4

I am tying to convert a PDF to an image so I can OCR it. But the quality is being degraded during the conversion.

There seem to be two main methods for converting a PDF to an image (JPG/PNG) with Python - pdf2image and ImageMagick/Wand.

#pdf2image (altering dpi to 300/600 etc does not seem to make a difference):
pages = convert_from_path("page.pdf", dpi=300)
    for page in pages:
        page.save("page.jpg", 'JPEG')

#ImageMagick (Wand lib)
with Image(filename="page.pdf", resolution=300) as img:
    img.compression_quality = 100
    img.save(filename="page.jpg")

But if I simply take a screenshot of the PDF on a Mac, the quality is higher than using either Python conversion method.

A good way to see this is to run Tesseract OCR on the resulting images - both Python methods give average results, whereas the screenshot gives perfect results. (I've tried both PNG and JPG.)

Assume I have infinite time, computing power and storage space. I am only interested in image quality and OCR output. It's frustrating to have the perfect image just within reach, but not be able to generate it in code.

What is going on here? Is there a better way to convert a PDF? Is there a way I can get more direct control? What is a screenshot doing such a better job than an actual conversion?

Christoph Rackwitz
  • 11,317
  • 4
  • 27
  • 36
samiles
  • 3,768
  • 12
  • 44
  • 71
  • 2
    This could be a bad idea, normally the text is visible in the PDF so you should be able to extract it directly. And if the text is in images then extract the images and process them directly instead of adding another layer or artifacts. – xenoid Mar 01 '22 at 07:16
  • please show that screenshot from a Mac, vs. the other two conversions. likely the Mac takes the screenshot at a higher DPI than you specified in the other two methods, so you should consider choosing better DPI values. – Christoph Rackwitz Mar 01 '22 at 12:47
  • same problem as you with convert_from_path() - I call the method with dpi=350 but the saved image only has a dpi of.. 24 ??? Have you found a better way by now? – ptrckdev Dec 11 '22 at 13:41

1 Answers1

2

You can use PyMuPDF and set the dpi you want:

import fitz

doc = fitz.open('some/pdf/path')
page = doc.load_page(0)
pixmap = page.get_pixmap(dpi=300)
img = pixmap.tobytes()
# Continue with whatever logic...
Jboulery
  • 238
  • 2
  • 9