Extract images with podofo from pdf pages

Question

I need to extract all images from a pdf file with podofo. Extracting all images from the file works well. I used the image extractor example for that. This receives all objects and iterates over them. But I need to iterate over pages and check for image objects on a page. Does anyone know how to do that?

score 1 · Accepted Answer · answered Mar 28 '18 at 17:19

Piggy backing off podofoimgextract, you could iterate each page, get the page resource object, check for an XObject or Image, and from here it's pretty much the exact same code that is used in the image extract utility.

for (int pageN = 0; pageN < document.GetPageCount(); pageN++) {
  PdfPage* page = document.GetPage(pageN);
  PdfDictionary resource = page->GetResources()->GetDictionary();

  for (auto& k : resource.GetKeys()) {
    if (k.first.GetName() == "XObject" || k.first.GetName() == "Image") {
      if (k.second->IsDictionary()) {
        auto targetDict = k.second->GetDictionary();
        for (auto& r : k.second->GetDictionary().GetKeys()) {
          // The XObject will usually contain indirect objects as it's values.
          // Check for a reference
          if (r.second->IsReference()) {
            // Get the object that is being referenced.
            auto target =
              document.GetObjects().GetObject(r.second->GetReference());
            if (target->IsDictionary()) {
              auto targetDict = target->GetDictionary();
              auto kf = targetDict.GetKey(PdfName::KeyFilter);
              if (!kf)
                continue;
              if (kf->IsArray() && kf->GetArray().GetSize() == 1 &&
                  kf->GetArray()[0].IsName() &&
                  kf->GetArray()[0].GetName().GetName() == "DCTDecode") {
                kf = &kf->GetArray()[0];
              }
              if (kf->IsName() && kf->GetName().GetName() == "DCTDecode") {
                ExtractImage(target, true);
              } else {
                ExtractImage(target, false);
              }
            }
          }
        }
      }
    }
  }
}

While this method may work with pdf documents that have all resources binded to pages, it won't work with documents with annotations. These documents have resources that are binded to the the document itself, since the appearance of the annotations (that is a XObject) is not stored as a page resource. Current source version of `podofoimgextract` is more accurate with this regard and iterates all documents objects. — ceztko, Apr 11 '18 at 13:31
@ceztko I was not aware of this, could you explain more in-depth as to where the image object is stored when it's not bound to the pages resources? Thank you. — Cory Mickelson, May 04 '18 at 20:11
You just iterate all document objects with `PdfVecObjects & PdfMemDocument::GetObjects()`. `podofoimgextract` is currently doing this. — ceztko, May 05 '18 at 11:20
Yes of course you can iterate all objects but does this provide some way of knowing which page the image is painted on? If for example I only want images from page 2 is it possible to accomplish this by iterating all document objects? — Cory Mickelson, May 06 '18 at 13:12
Since I was talking about images/resources of annotations: you can determine what page the annotation is reliably, but to find all images that are used in the annotation appearance you have to parse the xobject stream since the `Resource` entry dictionary is optional (PdfReference 1.7, 4.9.1 Form Dictionaries) — ceztko, May 06 '18 at 17:35

Extract images with podofo from pdf pages

1 Answers1

Linked