How to extract any image with python PDF extraction?

Question

I created a PDF extract program using TKinter, PYPDF2, and PIL by following a tutorial. This is the image extraction code

def extract_images(page):
    images = []
    if '/XObject' in page['/Resources']:
        xObject = page['/Resources']['/XObject'].getObject()

        for obj in xObject:
            if xObject[obj]['/Subtype'] == '/Image':
                size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
                data = xObject[obj].getData()
                mode = ""
                if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                    mode = "RGB"
                else:
                    mode = "CMYK"
                img = Image.frombytes(mode, size, data)
                images.append(img)
    else:
        img = Image.new("RGB", (100, 100), (255, 255, 255))
        images.append(img)
        
    return images

It worked with the provided test files, but no other pdf, usually giving the error

raise NotImplementedError("unsupported filter %s" % filterType) NotImplementedError: unsupported filter /DCTDecode

I've tried changing the code, but I simply cannot find a solution

score 0 · Answer 1 · answered Jun 18 '23 at 11:26

It became way easier with pypdf >= 3.10.0.

From the docs:

from pypdf import PdfReader

reader = PdfReader("example.pdf")

page = reader.pages[0]
count = 0

for image_file_object in page.images:
    with open(str(count) + image_file_object.name, "wb") as fp:
        fp.write(image_file_object.data)
        count += 1

As a side-note: PyPDF2 is deprecated. Use pypdf.

How to extract any image with python PDF extraction?

1 Answers1