I've been trying for about a week to automate image extraction from a pdf. Unfortunately, the answers I found here were of no help. I've seen multiple variations on the same code using pypdf2, all with ['/XObject']
in them, which results in a KeyError
.
What I'm looking for seems to be hiding in streams, which I can't find in pypdf2
's dictionary (even after recursively exploring the whole structure, calling .getObject()
on every indirect object I can find).
Using pypdf2
I've written one page off the pdf and opened it using Notepad++
, to find some streams with the /FlateDecode
filter.
pdfrw
was slightly more helpful, allowing me to use PdfReader(path).pages[page].Contents.stream
to get A stream (no clue how to get the others).
Using zlib
, I decompressed it, and got something starting with:
/Part <</MCID 0 >>BDC
(It also contains a lot of floating-point numbers, both positive and negative)
From what I could find, BDC
has something to do with ghostscript.
At this point I gave up and decided to ask for help.
Is there a python tool to, at least, extract all streams (and identify FlateDecode
tag?)
And is there a way for me to identify what's hidden in there? I expected the start tag of some image format, which this clearly isn't. How do I further parse this result to find any image that could be hidden in there?
I'm looking for something I can apply to any PDF that's displayed properly. Some tool to further parse, or at least help me make sense of the streams, or even a reference that will help me understand what's going on.
Edit: it seems, as noted by Patrick, that I was barking up the wrong tree. I went to streams since I couldn't find any xObjects
when opening the PDF in Notepad++
, or when running the various python scripts used to parse PDFs. I managed to find what I suspect are the images, with no xObject
tags, but with what seems like a stream tag - though the information is not compressed.