I need to extract all images from a pdf file with podofo. Extracting all images from the file works well. I used the image extractor example for that. This receives all objects and iterates over them. But I need to iterate over pages and check for image objects on a page. Does anyone know how to do that?
Asked
Active
Viewed 1,580 times
1 Answers
1
Piggy backing off podofoimgextract, you could iterate each page, get the page resource object, check for an XObject or Image, and from here it's pretty much the exact same code that is used in the image extract utility.
for (int pageN = 0; pageN < document.GetPageCount(); pageN++) {
PdfPage* page = document.GetPage(pageN);
PdfDictionary resource = page->GetResources()->GetDictionary();
for (auto& k : resource.GetKeys()) {
if (k.first.GetName() == "XObject" || k.first.GetName() == "Image") {
if (k.second->IsDictionary()) {
auto targetDict = k.second->GetDictionary();
for (auto& r : k.second->GetDictionary().GetKeys()) {
// The XObject will usually contain indirect objects as it's values.
// Check for a reference
if (r.second->IsReference()) {
// Get the object that is being referenced.
auto target =
document.GetObjects().GetObject(r.second->GetReference());
if (target->IsDictionary()) {
auto targetDict = target->GetDictionary();
auto kf = targetDict.GetKey(PdfName::KeyFilter);
if (!kf)
continue;
if (kf->IsArray() && kf->GetArray().GetSize() == 1 &&
kf->GetArray()[0].IsName() &&
kf->GetArray()[0].GetName().GetName() == "DCTDecode") {
kf = &kf->GetArray()[0];
}
if (kf->IsName() && kf->GetName().GetName() == "DCTDecode") {
ExtractImage(target, true);
} else {
ExtractImage(target, false);
}
}
}
}
}
}
}
}

Cory Mickelson
- 26
- 1
-
While this method may work with pdf documents that have all resources binded to pages, it won't work with documents with annotations. These documents have resources that are binded to the the document itself, since the appearance of the annotations (that is a XObject) is not stored as a page resource. Current source version of `podofoimgextract` is more accurate with this regard and iterates all documents objects. – ceztko Apr 11 '18 at 13:31
-
@ceztko I was not aware of this, could you explain more in-depth as to where the image object is stored when it's not bound to the pages resources? Thank you. – Cory Mickelson May 04 '18 at 20:11
-
You just iterate all document objects with `PdfVecObjects & PdfMemDocument::GetObjects()`. `podofoimgextract` is currently doing this. – ceztko May 05 '18 at 11:20
-
Yes of course you can iterate all objects but does this provide some way of knowing which page the image is painted on? If for example I only want images from page 2 is it possible to accomplish this by iterating all document objects? – Cory Mickelson May 06 '18 at 13:12
-
Since I was talking about images/resources of annotations: you can determine what page the annotation is reliably, but to find all images that are used in the annotation appearance you have to parse the xobject stream since the `Resource` entry dictionary is optional (PdfReference 1.7, 4.9.1 Form Dictionaries) – ceztko May 06 '18 at 17:35