How can I check if PDF page is image(scanned) by PDFBOX, XPDF

Question

PDFBox problem on extract images. Hi, how I can check if pdf page is image and to extract that by PDFBOX library, there is a method to get images but if PDF Page is a Image it is not getting. could some one help me to solve this problem.

Xpdf problem on extract images. I try to extract images by another library xpdf it do strange flip on the page if it is a image. If pdf contain an small image as object image it give me ok, if page is scanned he us doing flip.

I want to extract the all Images from PDF, if PAGE is scanned to get them as image, if Page contain plain text and Images also to get Images from this page.

My point is to extract all Images from PDF. not only form a page but even if page is a image to extract them as image but do not skip them how is doing I think PDFbox.

XPDF is doing some thing but there is a problem FLIP(top,right) on page when he export scanned page

How can I solve this problem thanks.

Download File example for to test

    `PDDocument document = PDDocument.load(new File("/home/dru/IdeaProjects2/PDFExtractor/test/t1.pdf"));
    PDPageTree list = document.getPages();

    for (PDPage page : list) {
        PDResources pdResources = page.getResources();
        System.out.println(pdResources.getResourceCache());

        for (COSName c : pdResources.getXObjectNames()) {
            PDXObject o = pdResources.getXObject(c);

            if (o instanceof org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject) {
                File file = new File("/home/dru/IdeaProjects2/PDFExtractor/test/out/" + System.nanoTime() + ".png");
                ImageIO.write(((org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject)o).getImage(), "png", file);
            }
        }
    }`

Your question is unclear. PDFs can have images even if they aren't scanned. A flip can be because the user inserted the paper in the wrong direction into the feeder. — Tilman Hausherr, Nov 10 '16 at 19:09
The PDF file PAGE is IMAGE, for example do you scan paper, scanner give you option to save Image as PDF file ok. How I can detect if PDF is scanned using PDFBOX, because if I try to getImages from Page by PDFbox it is looking for some Objects with type images, but he do not detect if PDF Page is full image. — dmitri, Nov 11 '16 at 08:30
So what you're really asking is whether an image has the size of the page? But even that is not a certain indicator whether an image was scanned. — Tilman Hausherr, Nov 11 '16 at 16:22
Please share a sample PDF to illustrate the issue and your pivotal code which fails to extract the image. — mkl, Nov 11 '16 at 16:56
The only special thing about the two images returned for your sample PDF is that one image is merely a mask used for the other image, and the other image is the actual image used on the PDF page. If you only want the images immediately used in the page content, you also have to scan the page content. — mkl, Nov 15 '16 at 11:23
By the way, your sample PDF page contains text: "Powered by TCPDF (www.tcpdf.org)". Just do in Adobe Reader and paste into some editor. Thus, it is an example for a page which contains plain text and Images. — mkl, Nov 15 '16 at 11:27
I have added an other example of PDF, first one, I think was not ok. — dmitri, Nov 15 '16 at 16:19
The new PDF does not have any images immediately on the page but it has form xobjects drawn onto it which do contain images. Thus, your image search has to recurse into the form xobjects. And that is not all: All pages share the same resources dictionary, they merely pick a different of its form xobjects to display. Thus, you really have to parse the respective page content stream to determine which xobject (with which images) is present on a given page. — mkl, Nov 15 '16 at 21:33

score 6 · Accepted Answer · edited Jun 20 '20 at 09:12

6

Extract images properly

As the updated PDF makes clear the problem is that it does not have any images immediately on the page but it has form xobjects drawn onto it which do contain images. Thus, the image search has to recurse into the form xobjects.

And that is not all: All pages in the updated PDF share the same resources dictionary, they merely pick a different of its form xobjects to display. Thus, one really has to parse the respective page content stream to determine which xobject (with which images) is present on a given page.

Actually this is something the PDFBox tool ExtractImages does. Unfortunately, though, it does not show the page it found the image in question on, cf. the ExtractImages.java test method testExtractPageImagesTool10948New.

But we can simply borrow from the technique used by that tool:

PDDocument document = PDDocument.load(resource);
int page = 1;
for (final PDPage pdPage : document.getPages())
{
    final int currentPage = page;
    PDFGraphicsStreamEngine pdfGraphicsStreamEngine = new PDFGraphicsStreamEngine(pdPage)
    {
        int index = 0;
        
        @Override
        public void drawImage(PDImage pdImage) throws IOException
        {
            if (pdImage instanceof PDImageXObject)
            {
                PDImageXObject image = (PDImageXObject)pdImage;
                File file = new File(RESULT_FOLDER, String.format("10948-new-engine-%s-%s.%s", currentPage, index, image.getSuffix()));
                ImageIOUtil.writeImage(image.getImage(), image.getSuffix(), new FileOutputStream(file));
                index++;
            }
        }

        @Override
        public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException { }

        @Override
        public void clip(int windingRule) throws IOException { }

        @Override
        public void moveTo(float x, float y) throws IOException {  }

        @Override
        public void lineTo(float x, float y) throws IOException { }

        @Override
        public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException {  }

        @Override
        public Point2D getCurrentPoint() throws IOException { return null; }

        @Override
        public void closePath() throws IOException { }

        @Override
        public void endPath() throws IOException { }

        @Override
        public void strokePath() throws IOException { }

        @Override
        public void fillPath(int windingRule) throws IOException { }

        @Override
        public void fillAndStrokePath(int windingRule) throws IOException { }

        @Override
        public void shadingFill(COSName shadingName) throws IOException { }
    };
    pdfGraphicsStreamEngine.processPage(pdPage);
    page++;
}

(ExtractImages.java test method testExtractPageImages10948New)

This code outputs images with file names "10948-new-engine-1-0.tiff", "10948-new-engine-2-0.tiff", "10948-new-engine-3-0.tiff", and "10948-new-engine-4-0.tiff", i.e. one per page.

PS: Please remember to include com.github.jai-imageio:jai-imageio-core in your classpath, it is required for TIFF output.

Flipped images

Another issue of the OP was that the images sometimes appear flipped upside-down, e.g. in case of his now newest sample file "t1_edited.pdf". The reason is that those images indeed are stored upside-down as image resources in the PDF.

When those images are drawn onto a page, the current transformation matrix in effect at that time mirrors the image drawn vertically and so creates the expected appearance.

By slightly enhancing the drawImage implementation in the code above, one can include indicators of such flips in the names of the exported images:

public void drawImage(PDImage pdImage) throws IOException
{
    if (pdImage instanceof PDImageXObject)
    {
        Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
        String flips = "";
        if (ctm.getScaleX() < 0)
            flips += "h";
        if (ctm.getScaleY() < 0)
            flips += "v";
        if (flips.length() > 0)
            flips = "-" + flips;
        PDImageXObject image = (PDImageXObject)pdImage;
        File file = new File(RESULT_FOLDER, String.format("t1_edited-engine-%s-%s%s.%s", currentPage, index, flips, image.getSuffix()));
        ImageIOUtil.writeImage(image.getImage(), image.getSuffix(), new FileOutputStream(file));
        index++;
    }
}

Now vertically or horizontally flipped images are marked accordingly.

edited Jun 20 '20 at 09:12

Community

1
1

answered Nov 15 '16 at 22:50

mkl

90,588
15
125
265

I did a test on file which had Size about of 1MB and test fail, Exception in thread "main" java.lang.OutOfMemoryError: Java heap space could you please download the example, and do same test to see the trace of error, I want to know how to fix that? thanks you – dmitri Nov 16 '16 at 08:42
I just ran your sample file through the extractor described above (cf. [ExtractImages test `testExtractPageImagesT1Edited`](https://github.com/mkl-public/testarea-pdfbox2/blob/master/src/test/java/mkl/testarea/pdfbox2/extract/ExtractImages.java#L182) but didn't get an `OutOfMemoryError`, it merely took quite some time... You should try and allot more memory to that process. – mkl Nov 16 '16 at 10:08
I have used pdfbox 2.0.3, do you have same version? – dmitri Nov 16 '16 at 10:58
I actually use the current PDFBox development branch, 2.1.0-SNAPSHOT. But I just repeated the test with PDFBox 2.0.3, no problems there. Furthermore I'm working on Java 8; memory management has considerably changed between Java 7 and Java 8.If you happen to use Java 7 or earlier, you might run into memory exhaustion earlier. – mkl Nov 16 '16 at 11:21
I have using the Java 8, PDFBox 2.0.3, I will see how to fix that, if you have idea ho to incrise memory limit will be fine, may is IDE limitation on run time execution, to avoid auto exit, as enveronment I use Intellij Idea – dmitri Nov 16 '16 at 12:42
yes no problem it extract images even for last version, why it do FLIP?, if you run test see result pages are on FLIP, and this behavior is not for all images only for some of them – dmitri Nov 16 '16 at 12:59
They are flipped because those specific images are stored upside-down in the PDF resources. But when those image resources are used, a transformation matrix is in place which mirrors vertically and so results in the desired appearance. – mkl Nov 16 '16 at 15:17
can I identify if image is flipped to fix that problem reading PDImageXObject? – dmitri Nov 16 '16 at 16:33
Yes, see my edit of the answer itself, the new section "Flipped images". – mkl Nov 16 '16 at 17:26
Thanks a lot again, I'm still looking for Mirror problem, for Vertical Flip it is ok, but for Mirror he did not detect, I think there are inversed the Pixels – dmitri Nov 17 '16 at 09:21
Which image in which of your sample documents was a sample of that *Mirror problem*? – mkl Nov 17 '16 at 09:38
Yes, the pdf file attached has Mirror problem, if you look at result, the page Image has Vertical Flip and on Horizontal Mirror – dmitri Nov 17 '16 at 09:53
I don't see any mirroring in addition the vertical flip. The vertical flip already is a mirroring, it is not a rotation. – mkl Nov 17 '16 at 10:39
Hi, I have another PDF file on which extract page Image is not working, can tou download the last file to test? – dmitri Dec 13 '16 at 14:36
@dmitri *"I have another PDF file on which extract page Image is not working"* - In which way does it not work? I get all the stripes embedded in the PDF. – mkl Dec 13 '16 at 16:33
using PDFGraphicsStreamEngine I do not get them, only some Warnings, related to LineTo, moveTo and closePath, warning message I will put later here – dmitri Dec 13 '16 at 18:58
once I call pdfGraphicsStreamEngine.processPage(pdPage), I get warning and no call on drawImage, try to download last my file – dmitri Dec 13 '16 at 19:03
@dmitri For your latest file, 1604-Orange_flat_2_edited.pdf , the code above extracts 7 jpg files per page here. I have not changed the code, merely the input and output file names. I tried with both 2.0.3 and 2.1.0-SNAPSHOT. – mkl Dec 14 '16 at 07:53
yes you are right there are some images slised, I have attached another example of pdf (test_faxt.pdf), try to download and test them I can extract only a logo – dmitri Dec 14 '16 at 15:27
@dmitri You only extract the logo from test_fact.pdf because the logo really is the only bitmap image in that file. All the rest, lines, colored areas *and text*, is drawn using vector graphics operations: paths consisting of lines, curves and rectangles, stroked or filled. This is one way to create a PDF and prevent easy text and change - you cannot simply take Adobe Acrobat and use the text touchup tool. – mkl Dec 14 '16 at 22:09
I tried to extract text by PDFTextStripper, and there no Text, from this point I suppose is only image – dmitri Dec 15 '16 at 07:54
@dmitri PDF is not that trivial... ;) – mkl Dec 15 '16 at 10:48

How can I check if PDF page is image(scanned) by PDFBOX, XPDF

1 Answers1

Extract images properly

Flipped images

Linked