PdfBox image extraction: the library extracts all the images from document for every page

Question

I have a test PDF file with an image on every page: 4 pages, 4 images. The PDF file was created by converting corresponding Docx file into PDF using Libre Office.

And, here is С# a function to extract all the images from the PDF document:

public static void ExtractImages(string filePath)
{
    PDDocument pdfDocument = null;
    try
    {
        pdfDocument = PDDocument.load(filePath);

        List documentPages = pdfDocument.getDocumentCatalog().getAllPages();
        Iterator pagesIterator = documentPages.iterator();
        int i = 1;
        string name = null;
        int pageNumber = 0;

        while (pagesIterator.hasNext())
        {
            PDPage page = (PDPage)pagesIterator.next();
            PDResources resources = page.getResources();
            Map pageImages = resources.getXObjects();

            if (pageImages != null)
            {
                Iterator imageIterator = pageImages.keySet().iterator();
                while (imageIterator.hasNext())
                {
                    string key = (string)imageIterator.next();
                    PDXObjectImage image = (PDXObjectImage)pageImages.get(key);

                    var fileName = "C:\\" + i;
                    image.write2file(fileName);
                    i++;

                }
            }
        }
    }
    finally
    {
        pdfDocument?.close();
    }
}

The problem is, that

resources.getXObjects()

method returns me 4 images for every page in the document.

The problem is reproducible only for PDF files converted by the libre office. All others seems to work OK.

What could be a problem here?

Maybe the images were set as global resources, i.e., not just as page-local resources => please share your PDF. This can happen. The ExtractImages tool can detect this (and it can also extract images in nested items) => see its source code. — Tilman Hausherr, Jul 10 '19 at 13:14
In addition to the resources probably being global, please be aware that the resources of a page can host completely unused data! The resources of a page merely are a pool of objects available when drawing a page, there is no requirement to actually use all of them. — mkl, Jul 10 '19 at 13:31
As has already been conjectured above, all pages share the same single **Resources** dictionary which contains all images. To determine which image is on which page, use a custom `PDFGraphicsStreamEngine` as shown in the PDFBox tool `ExtractImages`. — mkl, Jul 12 '19 at 13:33
PDFGraphicsStreamEngine comes from version 2 of PdfBox. But the latest version available for .Net is 1.8.9, unfortunately. — Alex, Jul 12 '19 at 19:33

PdfBox image extraction: the library extracts all the images from document for every page

0 Answers0