1

I am trying to get a list of all SDF/COS objects within a PDF document, using PDFNet 7.0.4 and netcoreapp3.1. Using a different PDF parser, I know that this document has 570 total COS objects within it, including 3 images.

Initially I used PDFDoc to load the document, and iterated through the pages just looking for Element objects of type e_image or e_inline_image, but this only yielded 2 out of 3 images. In a larger document it did even worse; 0 out of ~2600 images.

Now, I've stepped back and am trying to do a lower level search via SDFDoc. I can get a trailer object, and then iterate through it, recursing any e_dict or e_stream objects, and returning anything that looks like a real object (i.e., anything that actually has an object number and generation).

IEnumerable<Obj> Recurse(Obj root)
{
    var idHash = new HashSet<PdfIdentifier>();

    return Recurse(root, idHash);

    static IEnumerable<Obj> Recurse(Obj obj, HashSet<PdfIdentifier> idHash)
    {
        var id = obj.ToPdfIdentifier();

        if (!idHash.Contains(id))
        {
            if (id != nullIdentifier)
            {
                idHash.Add(id);
                yield return obj;
            }

            if (obj.GetType().OneOf(Obj.ObjType.e_dict, Obj.ObjType.e_stream))
            {
                for (var iter = obj.GetDictIterator(); iter.HasNext(); iter.Next())
                {
                    foreach (var child in Recurse(iter.Value(), idHash))
                    {
                        yield return child;
                    }
                }
            }
        }
    }
}

static PdfIdentifier nullIdentifier = new PdfIdentifier() { Generation = 0, ObjectNum = 0 };

ToPdfIdentifier is a simple extension method to get the object number and generation:

public static PdfIdentifier ToPdfIdentifier(this pdftron.SDF.Obj obj) => new PdfIdentifier { ObjectNum = obj.GetObjNum(), Generation = obj.GetGenNum() };

This runs OK, but only returns 45 objects, none of them the images I'm actually interested in.

How can I simply get all COS objects from a document?


edit

Here is the original PDFDoc code we tried to get all images:

private IEnumerable<(PdfIdentifier id, Element el)> GetImages(Stream stream)
{
    var doc = new PDFDoc(stream);

    var reader = new ElementReader();

    for (var iter = doc.GetPageIterator(); iter.HasNext(); iter.Next())
    {
        reader.Begin(iter.Current());

        var el = reader.Next();
        while (el != null)
        {
            var type = el.GetType();
            if (el.GetType().OneOf(Element.Type.e_image, Element.Type.e_inline_image))
            {
                var obj = el.GetXObject();
                var id = el.GetXObject().ToPdfIdentifier();

                yield return (id, el);
            }
            el = reader.Next();
        }

        reader.End();
    }
}

This kind of worked in that it returned some images, but not all. For some sample documents it returned all, for some it returned a subset, and for some it returned none at all.


edit

Just for future reference, thanks to the answer below from Ryan, we ended up with a pair of nice clean extension methods:

public static IEnumerable<SDF.Obj> GetAllObj(this SDF.SDFDoc sdfDoc)
{
    var xrefTableSize = sdfDoc.XRefSize();
    for (int objNum = 0; objNum < xrefTableSize; objNum++)
    {
        var obj = sdfDoc.GetObj(objNum);
        if (obj.IsFree())
        {
            continue;
        }
        else
        {
            yield return obj;
        }
    }
}

and

public static string Subtype(this SDF.Obj obj) => obj.FindObj("Subtype") switch
{
    null => null,
    var s when s.IsName() => s.GetName(),
    var s when s.IsString() => s.GetAsPDFText(),
    _ => throw new Exception("COS object has an invalid Subtype entry")
};

Now we can get images as simply as sdfDoc.GetAllObj().Where(o => o.IsStream() && o.Subtype() == "Image"); or even use Linq:

from o in sdfDoc.GetAllObj()
where o.IsStream() && o.Subtype() == "Image"
select new Image(o);
superstator
  • 3,005
  • 1
  • 33
  • 43
  • Your code would skip any arrays, and also skips any indirect objects. "How can I simply get all COS objects from a document?" FYI Inline images are not COS objects by definition, they are completely defined inside the content stream itself. – Ryan Feb 13 '20 at 20:10
  • What is your overall objective? Get all the images in the PDF? All the images that are actually used in a page? What if there is an image in an annotation or attachment? I can provide code, but knowing what you want to accomplish is more important I think. Hopefully you can elaborate on why getting this info helps you. – Ryan Feb 13 '20 at 20:11
  • Yes, the overall goal is to get all images in the document, regardless of location. Our old parser would parse the trailer & xrefs and return all objects in the document, direct or otherwise, then we'd just filter to `/Subtype=Image` and be done. As I said, we tried this with `PDFDoc` but it failed pretty miserably. I'll edit the question to add that code as well. – superstator Feb 13 '20 at 21:22
  • Your first code snippet misses arrays. Your second code snippet skips Form XObjects, which often contain images inside. "Our old parser would parse the trailer & xrefs and return all objects in the document, direct or otherwise, then we'd just filter to /Subtype=Image and be done." That logic would miss any inline images. Regardless, I will provide you soon the code that should give you the same output. In the meantime, it would be great if you could provide an example PDF file, so we are on the same page. – Ryan Feb 13 '20 at 21:34
  • All of our working samples are copyrighted, so I can't post those. I'll look around and see if I can find something public domain that is useful. – superstator Feb 13 '20 at 21:38
  • How about https://arxiv.org/pdf/2002.04610.pdf. Our old parser gets 9 images from that document. – superstator Feb 13 '20 at 22:21

1 Answers1

1

If you want to get the images that are actually used on a page of the PDF (in case there happen to be unused images in the PDF), then you would use this sample code. This code would have the added bonus of including inline images. https://www.pdftron.com/documentation/samples/dotnetcore/cs/ImageExtractTest

Though the above can be slow, if the document has hundreds or thousands of pages, that are complicated graphically.

The otherway, as you described, is to iterate the COS objects. The following C# code finds all Image streams. Note, the PDF standard specifically states that Streams have to be Indirect objects. So I think you can safely omit reading through all the direct objects.

using (PDFDoc doc = new PDFDoc("2002.04610.pdf"))
{
    doc.InitSecurityHandler();
    int xrefSz = doc.GetSDFDoc().XRefSize();
    for (int xrefCounter = 0; xrefCounter < xrefSz; ++xrefCounter)
    {
        Obj o = doc.GetSDFDoc().GetObj(xrefCounter);
        if (o.IsFree())
        {
            continue;
        }
        if(o.IsStream())
        {
            Obj subtypeObj = o.FindObj("Subtype");
            if (subtypeObj != null)
            {
                string subtype = "";
                if(subtypeObj.IsName()) subtype = subtypeObj.GetName();
                if(subtypeObj.IsString()) subtype = subtypeObj.GetAsPDFText(); // Subtype should be a Name, but just in case
                if (subtype.CompareTo("Image") == 0)
                {
                    Console.WriteLine("Indirect object {0} is an Image Stream", o.GetObjNum());
                }
            }
        }
    }
}
Ryan
  • 2,473
  • 1
  • 11
  • 14