I have a test PDF file with an image on every page: 4 pages, 4 images. The PDF file was created by converting corresponding Docx file into PDF using Libre Office.
And, here is С# a function to extract all the images from the PDF document:
public static void ExtractImages(string filePath)
{
PDDocument pdfDocument = null;
try
{
pdfDocument = PDDocument.load(filePath);
List documentPages = pdfDocument.getDocumentCatalog().getAllPages();
Iterator pagesIterator = documentPages.iterator();
int i = 1;
string name = null;
int pageNumber = 0;
while (pagesIterator.hasNext())
{
PDPage page = (PDPage)pagesIterator.next();
PDResources resources = page.getResources();
Map pageImages = resources.getXObjects();
if (pageImages != null)
{
Iterator imageIterator = pageImages.keySet().iterator();
while (imageIterator.hasNext())
{
string key = (string)imageIterator.next();
PDXObjectImage image = (PDXObjectImage)pageImages.get(key);
var fileName = "C:\\" + i;
image.write2file(fileName);
i++;
}
}
}
}
finally
{
pdfDocument?.close();
}
}
The problem is, that
resources.getXObjects()
method returns me 4 images for every page in the document.
The problem is reproducible only for PDF files converted by the libre office. All others seems to work OK.
What could be a problem here?