pdf2image conversion of multi page PDFs to images returns the last page on all images

Question

So when I use the pdf2image python import, and pass a multi page PDF into the convert_from_bytes()- or convert_from_path() method, the output array does contain multiple images - but all images are of the last PDF page (whereas I would've expected that each image represented one of the PDF pages).

The output looks something like this:

Any idea on why this would occur? I can't find any solution to this online. I've found some vague suggestion that the use_cropbox argument might be used, but modifying it has no effect.

def convert(opened_file)
    # Read PDF and convert pages to PPM image objects
    try:
        _ppm_pages = self.pdf2image.convert_from_bytes(
            opened_file.read(),
            grayscale = True
        )
    except Exception as e:
        print(f"[CreateJPEG] Could not convert PDF pages to JPEG image due to error: \n    '{e}'")
        return

    # Do stuff with _ppm_pages
    for img in _ppm_pages:
        img.show() # ...all images in that list are of the last page

Sometimes the output is an empty 1x1 image, instead, which I also haven't found a reason for. So if you have any idea what that is about, please do let me know!

Thanks in advance, Simon

EDIT: Added code.

EDIT: So, when I try this in a random notebook, it actually works fine.

I've removed a few detours I used in my original code, and now it works. Still not sure what the underlying reason was though...

All the same, thanks for your help, everyone!

Has anyone found a solution ? – Arif Rasim May 26 '23 at 10:38 — Arif Rasim, May 26 '23 at 10:38

score 0 · Answer 1 · answered Mar 10 '22 at 10:16

0

I'm using this right now....

from pdf2image import convert_from_path

imgSet = convert_from_path(pathToPDF, 500)

That gives me a list of images within imgSet

answered Mar 10 '22 at 10:16

Amiga500

1,258
1
6
11

Right, I'm getting a set as well, but all images in that set are the same image - the last PDF page. If you look at all images in your imgSet, does each image represent a distinct page, or are they all images of the last page in the set? – Simon Mortensen Mar 10 '22 at 10:43
They are unique pages... (now off to quadruple check just to be sure to be sure - but 99.9999% sure they were different) – Amiga500 Mar 10 '22 at 11:02
Should be... When I try to do the same thing in a random notebook, it seems to work just fine. Very confused. – Simon Mortensen Mar 10 '22 at 11:06
1

Yeah, confirm as definitely different. Bit of a puzzle all right. I wonder is pdf2image using some underlying pdf property that is messed up? But strange that it then gives you the right number of pages, just doesn't iterate through them. – Amiga500 Mar 10 '22 at 11:08

score 0 · Answer 2 · answered Mar 10 '22 at 10:31

0

I guess you have to do something like this as described in the unit tests of the package.

        with open("./tests/test.pdf", "rb") as pdf_file:
            images_from_bytes = convert_from_bytes(pdf_file.read(), fmt="jpg")
            self.assertTrue(images_from_bytes[0].format == "JPEG")

answered Mar 10 '22 at 10:31

balu

1,023
12
18

Right, just tried that at your suggestion. Unfortunately, it had no impact on the images output. All images are still of the last PDF page – Simon Mortensen Mar 10 '22 at 10:49

pdf2image conversion of multi page PDFs to images returns the last page on all images

2 Answers2