Extract images properly
As the updated PDF makes clear the problem is that it does not have any images immediately on the page but it has form xobjects drawn onto it which do contain images. Thus, the image search has to recurse into the form xobjects.
And that is not all: All pages in the updated PDF share the same resources dictionary, they merely pick a different of its form xobjects to display. Thus, one really has to parse the respective page content stream to determine which xobject (with which images) is present on a given page.
Actually this is something the PDFBox tool ExtractImages
does. Unfortunately, though, it does not show the page it found the image in question on, cf. the ExtractImages.java test method testExtractPageImagesTool10948New
.
But we can simply borrow from the technique used by that tool:
PDDocument document = PDDocument.load(resource);
int page = 1;
for (final PDPage pdPage : document.getPages())
{
final int currentPage = page;
PDFGraphicsStreamEngine pdfGraphicsStreamEngine = new PDFGraphicsStreamEngine(pdPage)
{
int index = 0;
@Override
public void drawImage(PDImage pdImage) throws IOException
{
if (pdImage instanceof PDImageXObject)
{
PDImageXObject image = (PDImageXObject)pdImage;
File file = new File(RESULT_FOLDER, String.format("10948-new-engine-%s-%s.%s", currentPage, index, image.getSuffix()));
ImageIOUtil.writeImage(image.getImage(), image.getSuffix(), new FileOutputStream(file));
index++;
}
}
@Override
public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException { }
@Override
public void clip(int windingRule) throws IOException { }
@Override
public void moveTo(float x, float y) throws IOException { }
@Override
public void lineTo(float x, float y) throws IOException { }
@Override
public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException { }
@Override
public Point2D getCurrentPoint() throws IOException { return null; }
@Override
public void closePath() throws IOException { }
@Override
public void endPath() throws IOException { }
@Override
public void strokePath() throws IOException { }
@Override
public void fillPath(int windingRule) throws IOException { }
@Override
public void fillAndStrokePath(int windingRule) throws IOException { }
@Override
public void shadingFill(COSName shadingName) throws IOException { }
};
pdfGraphicsStreamEngine.processPage(pdPage);
page++;
}
(ExtractImages.java test method testExtractPageImages10948New
)
This code outputs images with file names "10948-new-engine-1-0.tiff", "10948-new-engine-2-0.tiff", "10948-new-engine-3-0.tiff", and "10948-new-engine-4-0.tiff", i.e. one per page.
PS: Please remember to include com.github.jai-imageio:jai-imageio-core
in your classpath, it is required for TIFF output.
Flipped images
Another issue of the OP was that the images sometimes appear flipped upside-down, e.g. in case of his now newest sample file "t1_edited.pdf". The reason is that those images indeed are stored upside-down as image resources in the PDF.
When those images are drawn onto a page, the current transformation matrix in effect at that time mirrors the image drawn vertically and so creates the expected appearance.
By slightly enhancing the drawImage
implementation in the code above, one can include indicators of such flips in the names of the exported images:
public void drawImage(PDImage pdImage) throws IOException
{
if (pdImage instanceof PDImageXObject)
{
Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
String flips = "";
if (ctm.getScaleX() < 0)
flips += "h";
if (ctm.getScaleY() < 0)
flips += "v";
if (flips.length() > 0)
flips = "-" + flips;
PDImageXObject image = (PDImageXObject)pdImage;
File file = new File(RESULT_FOLDER, String.format("t1_edited-engine-%s-%s%s.%s", currentPage, index, flips, image.getSuffix()));
ImageIOUtil.writeImage(image.getImage(), image.getSuffix(), new FileOutputStream(file));
index++;
}
}
Now vertically or horizontally flipped images are marked accordingly.