Finding and identifying streams in PDF using python

Question

I've been trying for about a week to automate image extraction from a pdf. Unfortunately, the answers I found here were of no help. I've seen multiple variations on the same code using pypdf2, all with ['/XObject'] in them, which results in a KeyError.

What I'm looking for seems to be hiding in streams, which I can't find in pypdf2's dictionary (even after recursively exploring the whole structure, calling .getObject() on every indirect object I can find).

Using pypdf2 I've written one page off the pdf and opened it using Notepad++, to find some streams with the /FlateDecode filter.

pdfrw was slightly more helpful, allowing me to use PdfReader(path).pages[page].Contents.stream to get A stream (no clue how to get the others).

Using zlib, I decompressed it, and got something starting with:

/Part <</MCID 0 >>BDC

(It also contains a lot of floating-point numbers, both positive and negative)

From what I could find, BDC has something to do with ghostscript.

At this point I gave up and decided to ask for help.

Is there a python tool to, at least, extract all streams (and identify FlateDecode tag?)

And is there a way for me to identify what's hidden in there? I expected the start tag of some image format, which this clearly isn't. How do I further parse this result to find any image that could be hidden in there?

I'm looking for something I can apply to any PDF that's displayed properly. Some tool to further parse, or at least help me make sense of the streams, or even a reference that will help me understand what's going on.

Edit: it seems, as noted by Patrick, that I was barking up the wrong tree. I went to streams since I couldn't find any xObjects when opening the PDF in Notepad++, or when running the various python scripts used to parse PDFs. I managed to find what I suspect are the images, with no xObject tags, but with what seems like a stream tag - though the information is not compressed.

There is so many tools generating PDF files (many of them borderline defective), it is hard to give advice without seeing a sample of the particular document that is giving you trouble. Is there some sample you can share? — Paulo Scardine, Aug 07 '17 at 12:49
@PauloScardine sorry, I realize my wording implies I'm looking for help on a specific PDF. The PDF document is properly displayed in a reader, and I can extract a page out of it with no problem. Just can't find any reference for the content of the streams (Or any python tool that can easily do it for me. That'd be nice too). Anyway, Looking for something that'll work on any PDF doc that is displayed properly. — user1999728, Aug 07 '17 at 17:46
You *got something starting with: `/Part <>BDC`* and *at this point gave up*? Why? You successfully arrived in a PDF content stream. You merely would have had to take the pdf specification ISO 32000-1 to interpret the stream content. — mkl, Aug 07 '17 at 18:24
@mkl because my searches didn't turn up the ISO 320000-1 thing. I'll look it up. Thank you! — user1999728, Aug 07 '17 at 18:30
As a hint: Since 2008 Adobe has been offering a freely downloadable version at http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf — mkl, Aug 07 '17 at 18:47
@user1999728 I asked for a sample because I don't have a document like yours, and without one I'm unable to reproduce your problem. As you may already know, one will receive more answers here if volunteers receive a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve). I do use pypdf2 to extract images from PDFs (more than a million documents/year) and I use several forks that implement codecs lacking on pypdf2. — Paulo Scardine, Aug 07 '17 at 21:24

score 0 · Answer 1 · answered Aug 08 '17 at 15:11

Unless you are looking to extract inline images, which aren't that common, the content stream is not the place to look for images. The more common case are Streams of type XObject, of subtype Image, which are usually found in a page's Resource->XObject dictionary (see sections 7.3.3, 7.8.3, and 8.95 of the PDF Reference indicated by @mkl).

Alternately, Image XObjects can also be found in Form XObjects (subtype Form, which indicates they have their own content streams) in their own Resource->XObject dictionary, so the search for Image XObjects can be recursive.

An Image XObject can also have a softMask, which is itself its own Image XObject. Form XObjects are also used in Tiling Patterns, and so could conceivably contain Image XObjects (but they aren't that common either), or used in an Annotation's Normal Appearance (but Image XObjects are less commonly used within such Annotations, except maybe 3D or multimedia annotations).

Finding and identifying streams in PDF using python

1 Answers1