1

When I extract an image using pdfbox I am getting incorrect dpi of the image for some PDFs. When I extract an image using Photoshop or Acrobat Reader Pro I can see that the dpi of the image is 200 using windows photo viewer, but when I extract the image using pdfbox the dpi is 72.

For extracting the image I am using following code : Not able to extract images from PDFA1-a format document

When I check the logs I see an unusual entry: 2015-01-23-main--DEBUG-org.apache.pdfbox.util.TIFFUtil:


     <?xml version="1.0" encoding="UTF-8"?><javax_imageio_jpeg_image_1.0>
      <JPEGvariety>
    <app0JFIF majorVersion="1" minorVersion="2" resUnits="0" Xdensity="1" Ydensity="1" thumbWidth="0" thumbHeight="0"/>
  </JPEGvariety>
  <markerSequence>
    <dqt>
      <dqtable elementPrecision="0" qtableId="0"/>
      <dqtable elementPrecision="0" qtableId="1"/>
    </dqt>
    <dht>
      <dhtable class="0" htableId="0"/>
      <dhtable class="0" htableId="1"/>
      <dhtable class="1" htableId="0"/>
      <dhtable class="1" htableId="1"/>
    </dht>
    <sof process="0" samplePrecision="8" numLines="0" samplesPerLine="0" numFrameComponents="3">
      <componentSpec componentId="1" HsamplingFactor="2" VsamplingFactor="2" QtableSelector="0"/>
      <componentSpec componentId="2" HsamplingFactor="1" VsamplingFactor="1" QtableSelector="1"/>
      <componentSpec componentId="3" HsamplingFactor="1" VsamplingFactor="1" QtableSelector="1"/>
    </sof>
    <sos numScanComponents="3" startSpectralSelection="0" endSpectralSelection="63" approxHigh="0" approxLow="0">
      <scanComponentSpec componentSelector="1" dcHuffTable="0" acHuffTable="0"/>
      <scanComponentSpec componentSelector="2" dcHuffTable="1" acHuffTable="1"/>
      <scanComponentSpec componentSelector="3" dcHuffTable="1" acHuffTable="1"/>
    </sos>
  </markerSequence>
</javax_imageio_jpeg_image_1.0>

I tried to google but I can see to find out what pdfbox means by this log. What does this mean?

You can download a sample pdf with this problem from this link: http://myslams.com/test/1.pdf

I have even tried itext but it is extracting image with 96 dpi.

Am I doing something wrong? Or pdfbox and itext have this limitation?

Ghoul Fool
  • 6,249
  • 10
  • 67
  • 125
sameer singh
  • 169
  • 3
  • 15
  • Generally dpi does not make sense in the context of bitmaps in PDFs. – mkl Jan 25 '15 at 21:54
  • are you saying that the sample pdf I have mentioned above contains bitmap images . – sameer singh Jan 26 '15 at 05:46
  • It doesn't make sense to expect an image to have a DPI value *after* extracting it from a PDF. If you want to know the DPI of the image *while it is still inside the PDF*, you need to read the answer to this question: [Getting Image DPI in PDF files using iText](http://stackoverflow.com/questions/25550000/getting-image-dpi-in-pdf-files-using-itext) You should not claim that iText and PdfBox give you the wrong DPI. It's your understanding of DPI that is wrong. – Bruno Lowagie Jan 26 '15 at 07:29
  • Also, when you post a sample PDF, make sure that it doesn't say: *The requested URL /test/1.pdf was not found on this server.* – Bruno Lowagie Jan 26 '15 at 07:32
  • @bruno I had to remove the pdfs because of confidentiality .sorry for that ... will be putting in sample PDFs tomorrow. – sameer singh Jan 26 '15 at 08:29
  • *are you saying that the sample pdf I have mentioned above contains bitmap images* - No. I wanted to download the PDF now but it is gone AWOL. – mkl Jan 26 '15 at 09:01
  • @sameersingh It doesn't really matter if this PDF is present or not. You are assuming that an image extracted from a PDF has a DPI. That assumption is wrong. An image extracted from a PDF has a number of pixels. The DPI only makes sense when those pixels are rendered on a page using a specific dimension e.g. in points. – Bruno Lowagie Jan 26 '15 at 10:04
  • the log entry is the metadata of an image file when using ImageIOUtils to save an image. This only appears in debug mode. – Tilman Hausherr Jan 27 '15 at 21:48

1 Answers1

3

After some digging I found your 1.pdf. Thus,...

PDFBox

In comments to this recent answer @Tilman and you were discussing this older answer in which @Tilman pointed towards the PrintImageLocations PDFBox example. I ran it for your file and got:

Processing page: 0
*******************************************************************
Found image [Im0]
position = 0.0, 0.0
size = 1704px, 888px
size = 613.44, 319.68
size = 8.52in, 4.44in
size = 216.408mm, 112.776mm

Processing page: 1
*******************************************************************
Found image [Im0]
position = 0.0, 0.0
size = 1704px, 2800px
size = 613.44, 1008.0
size = 8.52in, 14.0in
size = 216.408mm, 355.6mm

Processing page: 2
*******************************************************************
Found image [Im0]
position = 0.0, 0.0
size = 1704px, 2800px
size = 613.44, 1008.0
size = 8.52in, 14.0in
size = 216.408mm, 355.6mm

Processing page: 3
*******************************************************************
Found image [Im0]
position = 0.0, 0.0
size = 1704px, 1464px
size = 613.44, 527.04
size = 8.52in, 7.3199997in
size = 216.408mm, 185.928mm

On all pages this amounts to 200 dpi both in x and y directions (1704px / 8.52in = 888px / 4.44in = 2800px / 14.0in = 1464px / 7.32in = 200 dpi).

So PDFBox gives you the dpi values you are after.

(@Tilman: The current 2.0.0-SNAPSHOT version of that sample returns utter nonsense; you might want to fix this.)

iText

A simplified iText version of that PDFBox example would be this:

public void printImageLocations(InputStream stream) throws IOException
{
    PdfReader reader = new PdfReader(stream);
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    ImageRenderListener listener = new ImageRenderListener();

    for (int page = 1; page <= reader.getNumberOfPages(); page++)
    {
        System.out.printf("\nPage %s:\n", page);
        parser.processContent(page, listener);
    }
}

static class ImageRenderListener implements RenderListener
{
    public void beginTextBlock() { }
    public void renderText(TextRenderInfo renderInfo) { }
    public void endTextBlock() { }

    public void renderImage(ImageRenderInfo renderInfo)
    {
        try
        {
            PdfDictionary imageDict = renderInfo.getImage().getDictionary();

            float widthPx = imageDict.getAsNumber(PdfName.WIDTH).floatValue(); 
            float heightPx = imageDict.getAsNumber(PdfName.HEIGHT).floatValue();
            float widthUu = renderInfo.getImageCTM().get(Matrix.I11);
            float heigthUu = renderInfo.getImageCTM().get(Matrix.I22);

            System.out.printf("Image %.0fpx*%.0fpx, %.0fuu*%.0fuu, %.2fin*%.2fin\n", widthPx, heightPx, widthUu, heigthUu, widthUu/72, heigthUu/72);
        }
        catch (IOException e)
        {
            e.printStackTrace();
        }
    }
}

(Beware: I assumed unrotated and unskewed images.)

The results for your file:

Page 1:
Image 1704px*888px, 613uu*320uu, 8,52in*4,44in

Page 2:
Image 1704px*2800px, 613uu*1008uu, 8,52in*14,00in

Page 3:
Image 1704px*2800px, 613uu*1008uu, 8,52in*14,00in

Page 4:
Image 1704px*1464px, 613uu*527uu, 8,52in*7,32in

Thus, also 200dpi all along. So iText, too, gives you the dpi values you are after.

Your code

Obviously the code you referenced had no chance to report a dpi value sensible in the context of the PDF because it only extracts the images as found in the resources but ignores how the respective image resource is used on the page.

An image resource can be stretched, rotated, skewed, ... any way the author likes when he uses it in the page content.

BTW, a dpi value only makes sense if the author did not skew and rotated only by a multiple of 90°.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • I use Maven to supply the required dependencies automatically. The Maven coordinates are `org.apache.pdfboxpdfbox-examples1.8.8` – mkl Jan 27 '15 at 12:18
  • thx .... I am checking the answer . Their are few specimen PDFs that I need to check will get back to you . – sameer singh Jan 27 '15 at 12:26
  • @mkl: thanks for bringing this to my attention. Fixed in PDFBOX-2635. (hopefully) – Tilman Hausherr Jan 27 '15 at 23:41
  • @mkl thx for the answer .Earlier when i tried to run the PrintImageLocations.java example I was getting lot of import problems but The PrintImageLocations PDFBox example link that you gave worked with the pdfbox 1.8.8 jar file and their was no import issues. And the same has also solved the other issue that i was having http://stackoverflow.com/questions/28141376/pdfbox-and-itext-not-able-to-extract-image – sameer singh Jan 29 '15 at 14:06
  • 1
    @sameersingh *Earlier when i tried to run the PrintImageLocations.java example I was getting lot of import problems* - most likely version mismatch, maybe you get the examples for the current PDFBox development version 2.0.0 SNAPSHOT. I recommend using maven to collect the dependencies required. – mkl Jan 29 '15 at 15:07
  • 2
    @sameersingh another trick that makes life somewhat easier is to use pdfbox-app and preflight-app. But besides that, maven is really useful. We at PDFBox use it. – Tilman Hausherr Jan 30 '15 at 08:33