Not able to extract images from PDFA1-a format document

Question

I am using the following code for extracting images from pdf which is in PDFA1-a format but I am not able to get the images .

List<PDPage> list = document.getDocumentCatalog().getAllPages();

String fileName = oldFile.getName().replace(".pdf", "_cover");
int totalImages = 1;
for (PDPage page : list) {

    PDResources pdResources = page.findResources();

    Map pageImages = pdResources.getImages();
    if (pageImages != null) {
        InputStream xmlInputStream = null;
        Iterator imageIter = pageImages.keySet().iterator();
        while (imageIter.hasNext()) {
            String key = (String) imageIter.next();
            PDXObjectImage pdxObjectImage = (PDXObjectImage) pageImages.get(key);

            System.out.println(convertStreamToString(xmlInputStream));
            System.out.println(pdxObjectImage.hashCode());
            System.out.println(pdxObjectImage.getColorSpace().getJavaColorSpace().isCS_sRGB());

            pdxObjectImage.write2file(destinationDir + fileName+ "_" + totalImages);
            totalImages++;

            break;
        }
    }
}

I am able to extract images for notmal PDFs using above code but am not able to extract it for PDFA1-a format pdfs. It seems the following line

PDResources pdResources = page.findResources();

is not returning images I have even tried page.getResources() but still not getting any images.I have even tried to use itext but still it is not giving me any images.

If i try to convert the page of PDF to image using the following code

BufferedImage bufferedImage = page.convertToImage();
File outputfile = new File(destinationDir+"image1.JPEG");
ImageIO.write(bufferedImage, "JPEG", outputfile);

these images seem to have no metadata associated with them So I still am not able to know their dpi or whether they are color or grey scale.

Currently I am using PDFBox for doing this.I have already spent 2 days on this searching on google but still I havent found any code or documentation for doing this.

How to do this in java ??

Is it possible to get DPI or whether the pdf is color or black and white without extracting the images ??

Have you checked whether the PDF in question contains bitmap image xobjects at all? Maybe the images are vector graphics or inlined bitmaps, neither of which will be captured by your code. — mkl, Jan 06 '15 at 14:20
replacing Map pageImages = pdResources.getImages(); with Map pageImages = pdResources.getXObjects(); will that help — sameer singh, Jan 06 '15 at 14:34
You might want to look at the PDFBox [ExtractImages.java](https://svn.apache.org/repos/asf/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractImages.java) tool. — mkl, Jan 06 '15 at 14:43
@sameersingh Except in some cases JPEG, the images don't have dpi information encoded within. DPI is meta information, i.e. you can't tell it just by seeing an image. Re color, one could look at the colorspace and at the bits per component, but my favourite method is go through the image coordinates and check that no color except 0,0,0 or 255,255,255 is there. Then you know it is b/w. — Tilman Hausherr, Jan 06 '15 at 15:44
@mkl: the ExtractImages tool from the trunk catches inline images, but the one from 1.8.8 doesn't. — Tilman Hausherr, Jan 06 '15 at 15:45
@sameersingh if you can upload the PDF somewhere, we'll tell you whether there are images. — Tilman Hausherr, Jan 06 '15 at 15:46
@TilmanHausherr *from 1.8.8 doesn't* - :) good thing I linked the trunk version... — mkl, Jan 06 '15 at 15:53
Please download the sample pdf from this link http://www.myslams.com/test/pdfa.PDF — sameer singh, Jan 07 '15 at 06:04
By using the above code I am not getting any error . The file that is being generated is a 0 byte png. are you able to extract the image ??? — sameer singh, Jan 07 '15 at 10:19

Tilman Hausherr · Accepted Answer · 2015-01-07T21:21:55.593

Your problems are a combination of two problems:

1) the "break;". Your file has two images. The first one is transparent or grey or whatever and JPEG encoded, but it isn't the one you want. The second one is the one you want but the break aborts after the first image. So I just changed a code segment of yours to this:

while (imageIter.hasNext())
{
     String key = (String) imageIter.next();
     PDXObjectImage pdxObjectImage = (PDXObjectImage) pageImages.get(key);
     System.out.println(totalImages);
     pdxObjectImage.write2file("C:\\SOMEPATH\\" + fileName + "_" + totalImages);
     totalImages++;

     //break;
 }

2) Your second image (the interesting one) is JBIG2 encoded. To decode this, you need to add the levigo plugin your class path, as mentioned here. If you don't, you'll get this message in 1.8.8, unless you disabled logging:

ERROR [main] org.apache.pdfbox.filter.JBIG2Filter:69 - Can't find an ImageIO plugin to decode the JBIG2 encoded datastream.

(You didn't get that error message because it is the second one that is JBIG2 encoded)

Three bonus hints:

3) if you created this image yourself, e.g. on a photocopy machine, find out how to get PDF images without JBIG2 compression, it is somewhat risky.

4) don't use pdResources.getImages(), the getImages call is deprecated. Instead, use getXObjects(), and then check the type of what you get when iterating.

 Iterator imageIter = pageImages.keySet().iterator();
 while (imageIter.hasNext())
 {
     String key = (String) imageIter.next();
     Object o = pageImages.get(key);
     if (o instanceof PDXObjectImage)
     {
         PDXObjectImage pdxObjectImage = (PDXObjectImage) o;

         // do stuff
     }
 }

5) use a foreach loop.

And if it wasn't already obvious: this has nothing to do with PDF/A :-)

6) I forgot you also asked how to see if it is a b/w image, here's some simple code (not optimized) that I mentioned in the comments:

BufferedImage bim = pdxObjectImage.getRGBImage();

boolean bwImage = true;

int w = bim.getWidth();
int h = bim.getHeight();
for (int y = 0; y < h; y++)
{
    for (int x = 0; x < w; x++)
    {
        Color c = new Color(bim.getRGB(x, y));
        int red = c.getRed();
        int green = c.getGreen();
        int blue = c.getBlue();
        if (red == 0 && green == 0 && blue == 0)
        {
            continue;
        }
        if (red == 255 && green == 255 && blue == 255)
        {
            continue;
        }
        bwImage = false;
        break;
    }
    if (!bwImage)
        break;
}
System.out.println(bwImage);

brother your code works perfectly .... Now i just have to deal with the pixel density ... it seems Matrix ctmNew = getGraphicsState().getCurrentTransformationMatrix(); is throwing a null pointer exception in the code you mentioned in this link http://stackoverflow.com/questions/5472711/dpi-of-image-extracted-from-pdf-with-pdfbox few files like org.apache.pdfbox.pdmodel.graphics.PDXObject are not present in pdfbox library. — sameer singh, Jan 08 '15 at 12:36
If it works, press the green checkmark... re: the other issue - are you using 1.8.8 or the unreleased 2.0 version? The API is different in 2.0. But PDXObject exists in both. What has changed is PDXObjectImage is called PDObjectXImage in 2.0. — Tilman Hausherr, Jan 08 '15 at 12:57
So if i use unreleased 2.0 version will that dpi code work ?? — sameer singh, Jan 08 '15 at 13:03
I assume I wrote it for the 1.8 version, its been 2 years already. But be sure that you get the PrintImageLocations.java example from the 1.8.8 source download if you're currently working with 1.8.8. Finally, please ask all the questions about the dpi thing in that other issue. If you still can't get it to work, mention what version you have decided to use, and I'll research (tonight) what's going on and come back to you. — Tilman Hausherr, Jan 08 '15 at 13:08

Not able to extract images from PDFA1-a format document

1 Answers1

Linked