1

So when I use the pdf2image python import, and pass a multi page PDF into the convert_from_bytes()- or convert_from_path() method, the output array does contain multiple images - but all images are of the last PDF page (whereas I would've expected that each image represented one of the PDF pages).

The output looks something like this:

pdf2image conversion bug

Any idea on why this would occur? I can't find any solution to this online. I've found some vague suggestion that the use_cropbox argument might be used, but modifying it has no effect.

def convert(opened_file)
    # Read PDF and convert pages to PPM image objects
    try:
        _ppm_pages = self.pdf2image.convert_from_bytes(
            opened_file.read(),
            grayscale = True
        )
    except Exception as e:
        print(f"[CreateJPEG] Could not convert PDF pages to JPEG image due to error: \n    '{e}'")
        return

    # Do stuff with _ppm_pages
    for img in _ppm_pages:
        img.show() # ...all images in that list are of the last page

Sometimes the output is an empty 1x1 image, instead, which I also haven't found a reason for. So if you have any idea what that is about, please do let me know!

Thanks in advance, Simon

EDIT: Added code.

EDIT: So, when I try this in a random notebook, it actually works fine.

I've removed a few detours I used in my original code, and now it works. Still not sure what the underlying reason was though...

All the same, thanks for your help, everyone!

2 Answers2

0

I'm using this right now....

from pdf2image import convert_from_path

imgSet = convert_from_path(pathToPDF, 500)

That gives me a list of images within imgSet

Amiga500
  • 1,258
  • 1
  • 6
  • 11
  • Right, I'm getting a set as well, but all images in that set are the same image - the last PDF page. If you look at all images in your imgSet, does each image represent a distinct page, or are they all images of the last page in the set? – Simon Mortensen Mar 10 '22 at 10:43
  • They are unique pages... (now off to quadruple check just to be sure to be sure - but 99.9999% sure they were different) – Amiga500 Mar 10 '22 at 11:02
  • Should be... When I try to do the same thing in a random notebook, it seems to work just fine. Very confused. – Simon Mortensen Mar 10 '22 at 11:06
  • 1
    Yeah, confirm as definitely different. Bit of a puzzle all right. I wonder is pdf2image using some underlying pdf property that is messed up? But strange that it then gives you the right number of pages, just doesn't iterate through them. – Amiga500 Mar 10 '22 at 11:08
0

I guess you have to do something like this as described in the unit tests of the package.

        with open("./tests/test.pdf", "rb") as pdf_file:
            images_from_bytes = convert_from_bytes(pdf_file.read(), fmt="jpg")
            self.assertTrue(images_from_bytes[0].format == "JPEG")
balu
  • 1,023
  • 12
  • 18
  • Right, just tried that at your suggestion. Unfortunately, it had no impact on the images output. All images are still of the last PDF page – Simon Mortensen Mar 10 '22 at 10:49