Extracting images from a PDF using PyPDF2 - but the pdf has no metadata

Question

The PDF is a scanned image, so there is no way I have found yet, to pull out the images. I have tried methods including crop and media boxes, but it pulls the entire pages as images. I have also tried other parsing libraries like pdfminer.six, but the entire page is pulled as a result.

I tried using media and cropboxes in hopes it would grab the image as specified but it pulls the entire page instead.

score 0 · Answer 1 · answered Mar 19 '23 at 23:35

If the document is scanned, the whole page is a single image. So all libraries will give you that.

As the maintainer of pypdf and PyPDF2, I can tell you that there is no way around that.

If you want the illustrations within an image file, you need machine learning. Our using an image cropping tool.

Extracting images from a PDF using PyPDF2 - but the pdf has no metadata

1 Answers1