3

I am using PDFrw and one of its example to extract the only image in a PFD file and save that image to a PNG or JPEG file.

The code is too challenging for me to understand, what parameters should I pass to find_objects?

from pdfrw.objects import PdfDict, PdfArray, PdfName
from pdfrw.pdfwriter import user_fmt


def find_objects(source, valid_types=(PdfName.XObject, None),
                 valid_subtypes=(PdfName.Form, PdfName.Image),
                 no_follow=(PdfName.Parent,),
                 isinstance=isinstance, id=id, sorted=sorted,
                 reversed=reversed, PdfDict=PdfDict):
    '''
        Find all the objects of a particular kind in a document
        or array.  Defaults to looking for Form and Image XObjects.
        This could be done recursively, but some PDFs
        are quite deeply nested, so we do it without
        recursion.
        Note that we don't know exactly where things appear on pages,
        but we aim for a sort order that is (a) mostly in document order,
        and (b) reproducible.  For arrays, objects are processed in
        array order, and for dicts, they are processed in key order.
    '''
    container = (PdfDict, PdfArray)

    # Allow passing a list of pages, or a dict
    if isinstance(source, PdfDict):
        source = [source]
    else:
        source = list(source)

    visited = set()
    source.reverse()
    while source:
        obj = source.pop()
        if not isinstance(obj, container):
            continue
        myid = id(obj)
        if myid in visited:
            continue
        visited.add(myid)
        if isinstance(obj, PdfDict):
            if obj.Type in valid_types and obj.Subtype in valid_subtypes:
                yield obj
            obj = [y for (x, y) in sorted(obj.iteritems())
                   if x not in no_follow]
        else:
            # TODO: This forces resolution of any indirect objects in
            # the array.  It may not be necessary.  Don't know if
            # reversed() does any voodoo underneath the hood.
            # It's cheap enough for now, but might be removeable.
            obj and obj[0]
        source.extend(reversed(obj))


find_objects('target.pdf')
Nyxynyx
  • 61,411
  • 155
  • 482
  • 830
  • Do you mean that this code (which is the exact example, right?) does not work? How does it fail? – Jongware Aug 12 '16 at 20:51
  • @RadLexus It does not return anything... I was hoping it will return something related to the image in the PFD. – Nyxynyx Aug 14 '16 at 22:33

1 Answers1

3

I am the pdfrw author, and I haven't written code yet to do that :(.

Typically if I need to do that, I use inkscape. It works great in a command line mode.

pdfrw may be useful as part of the reverse path. img2pdf.py is an awesome tool that will put PDF images on a page, and pdfrw can add those images (once they are in a PDF) to other pages.

Edited to add:

pdfrw is actually useful for extracting images, in that it can place all the images into a new PDF, one image per page. See extract.py in the examples directory.

It cannot (yet???) then extract the images to a JPEG, but that is an easy task with inkscape, which will even let you crop to the actual image size quite easily.

Patrick Maupin
  • 8,024
  • 2
  • 23
  • 42
  • 1
    It is possible to find the image objects using find_objects. Is it possible to extract the image binary data using pdfrw? After that it's just a matter of understanding how to decode the image, if needed. – Petri Oct 14 '17 at 11:04
  • Yes, the entries in the image stream dictionary describe the encoding of the stream, and the data is in the stream. Both of these things can easily be read with `pdfrw`. For an example of going the other way -- for taking an image and putting it into a PDF, take a look at [img2pdf](https://gitlab.mister-muffin.de/josch/img2pdf). – Patrick Maupin Oct 15 '17 at 13:39
  • @PatrickMaupin How can you use the library to list all objects and their types? – Iharob Al Asimi Mar 07 '18 at 15:04