I am trying to get a list of all SDF/COS objects within a PDF document, using PDFNet 7.0.4
and netcoreapp3.1
. Using a different PDF parser, I know that this document has 570 total COS objects within it, including 3 images.
Initially I used PDFDoc
to load the document, and iterated through the pages just looking for Element
objects of type e_image
or e_inline_image
, but this only yielded 2 out of 3 images. In a larger document it did even worse; 0 out of ~2600 images.
Now, I've stepped back and am trying to do a lower level search via SDFDoc
. I can get a trailer object, and then iterate through it, recursing any e_dict
or e_stream
objects, and returning anything that looks like a real object (i.e., anything that actually has an object number and generation).
IEnumerable<Obj> Recurse(Obj root)
{
var idHash = new HashSet<PdfIdentifier>();
return Recurse(root, idHash);
static IEnumerable<Obj> Recurse(Obj obj, HashSet<PdfIdentifier> idHash)
{
var id = obj.ToPdfIdentifier();
if (!idHash.Contains(id))
{
if (id != nullIdentifier)
{
idHash.Add(id);
yield return obj;
}
if (obj.GetType().OneOf(Obj.ObjType.e_dict, Obj.ObjType.e_stream))
{
for (var iter = obj.GetDictIterator(); iter.HasNext(); iter.Next())
{
foreach (var child in Recurse(iter.Value(), idHash))
{
yield return child;
}
}
}
}
}
}
static PdfIdentifier nullIdentifier = new PdfIdentifier() { Generation = 0, ObjectNum = 0 };
ToPdfIdentifier
is a simple extension method to get the object number and generation:
public static PdfIdentifier ToPdfIdentifier(this pdftron.SDF.Obj obj) => new PdfIdentifier { ObjectNum = obj.GetObjNum(), Generation = obj.GetGenNum() };
This runs OK, but only returns 45 objects, none of them the images I'm actually interested in.
How can I simply get all COS objects from a document?
edit
Here is the original PDFDoc
code we tried to get all images:
private IEnumerable<(PdfIdentifier id, Element el)> GetImages(Stream stream)
{
var doc = new PDFDoc(stream);
var reader = new ElementReader();
for (var iter = doc.GetPageIterator(); iter.HasNext(); iter.Next())
{
reader.Begin(iter.Current());
var el = reader.Next();
while (el != null)
{
var type = el.GetType();
if (el.GetType().OneOf(Element.Type.e_image, Element.Type.e_inline_image))
{
var obj = el.GetXObject();
var id = el.GetXObject().ToPdfIdentifier();
yield return (id, el);
}
el = reader.Next();
}
reader.End();
}
}
This kind of worked in that it returned some images, but not all. For some sample documents it returned all, for some it returned a subset, and for some it returned none at all.
edit
Just for future reference, thanks to the answer below from Ryan, we ended up with a pair of nice clean extension methods:
public static IEnumerable<SDF.Obj> GetAllObj(this SDF.SDFDoc sdfDoc)
{
var xrefTableSize = sdfDoc.XRefSize();
for (int objNum = 0; objNum < xrefTableSize; objNum++)
{
var obj = sdfDoc.GetObj(objNum);
if (obj.IsFree())
{
continue;
}
else
{
yield return obj;
}
}
}
and
public static string Subtype(this SDF.Obj obj) => obj.FindObj("Subtype") switch
{
null => null,
var s when s.IsName() => s.GetName(),
var s when s.IsString() => s.GetAsPDFText(),
_ => throw new Exception("COS object has an invalid Subtype entry")
};
Now we can get images as simply as sdfDoc.GetAllObj().Where(o => o.IsStream() && o.Subtype() == "Image");
or even use Linq:
from o in sdfDoc.GetAllObj()
where o.IsStream() && o.Subtype() == "Image"
select new Image(o);