3

I'm currently using php's imagick to convert some PDF to images - This works well for the small detail that the images are 'chopped' during output.

This is due to the difference in information contained on the PDF vs the actual content dimensions.

The PDF reports to be a 612x792 72ppi document, yet when I export an image from it via preview on the mac, the image is 1651x1275 - How is this possible?

Obviously the export is correct as the image is viewed correctly in those dimensions - Could it be that the PDF was simply wrongly encoded where the width and height were mixed up? How can I detect this via code? Also the image export is of a different (much larger) size, roughly twice the size, this leads me to believe some information isn't being read properly by imagick.

Basically I'd like to know if there is a proper way to determine the actual PDF content size, so that the images exported from it are at the best quality possible.

Thanks!

EDIT: (code added)

<?php
$im = new Imagick();
$im->readImage("SomeTest.pdf");
$im->setImageColorspace(255);
$im->setCompression(Imagick::COMPRESSION_JPEG);
$im->setCompressionQuality(60);
$im->setImageFormat('jpeg');
$im->writeImages("SampleImage.jpg");
?>

The pdf used is the following: http://www.pantone.com/pages/MYP_mypantone/software_downloader.aspx?f=3

Also, here is the output of imagick from the identifyImage() function, which seems a bit wrong looking at the file size.

Array
(
    [imageName] => /tmp/magick-XXehkI8e
    [format] => PDF (Portable Document Format)
    [geometry] => Array
        (
            [width] => 612
            [height] => 792
        )

    [type] => TrueColor
    [colorSpace] => RGB
    [resolution] => Array
        (
            [x] => 72
            [y] => 72
        )

    [units] => Undefined
    [fileSize] => 50mb
    [compression] => Undefined
    [signature] => 9426f3fc4f45afd71941435a37d585d01e01d32458f3ca241e72892c2f7f35d5
)
hakre
  • 193,403
  • 52
  • 435
  • 836
TeckniX
  • 673
  • 1
  • 7
  • 14
  • Everything seems fine until you get to the file size. That's Really Sketchy. – Mark Storer May 20 '11 at 17:06
  • Whenever you are converting PDFs to images with image magick, be sure to set the `-density` parameter to the correct DPI, otherwise the quality and size will be dire. – Orbling May 20 '11 at 18:21
  • Mark, the image size actually doesn't work - There's an obvious array of images being created within imagick that I need to figure out, so that I can set the size on each image prior to writing them out. – TeckniX May 20 '11 at 20:14

2 Answers2

3

You should be aware that PDF on its own is a resolution-free format. Pages are described in a mathematical means that isn't tied to any particular resolution limit except for those imposed by floating point numbers.

PDF only truly has resolution when it is rendered to a particular device (and that may or may not be at the device's resolution).

"But what about images? Images in PDFs surely give it resolution!" Sort of. Images in PDF are represented as unit-free samples and do not themselves have resolution until they are have been instantiated on a page. I can take a 300 dpi 8.5"x11" 1-bit image and embed it into a PDF, but that same image can be put into the content stream of a page in a space that fills an entire 8.5"x11" space, thus maintaining the resolution or it could be rendered into a much smaller thumbnail (creating a higher resolution through the scale) - and even those "resolutions" don't apply until the page is actually rendered to a device. In addition, PDF renderers are not prevented from doing bilinear (or some other) interpolation to increase the apparent resolution of an image.

To give you a much more concrete example, if I render a PDF page on a 96 dpi monitor at 100%, the resolution of that page is no greater than 96 dpi. If I render that PDF page on an 1800 dpi phototypesetter, the resolution of the page is no greater than 1800 dpi.

If I render a 300 dpi image at 100% on a PDF page rendered at 100% on a 96 dpi monitor, the resolution of the image on the page is 96 dpi. If I render a 300 dpi image at 100% on a PDF page rendered at 100% on an 1800 dpi phototypesetter, the resolution of the image on the page is 300 dpi.

The output you are seeing from image magick is probably reflecting that an 8.5" x 11" page in PDF units is 612 x 792 and 1 PDF unit is equivalent to 1/72 of an inch. The preview rendering appears to being done at ~194 dpi.

plinth
  • 48,267
  • 11
  • 78
  • 120
  • plinth thank you for this great explanation of the different rendering as I wasn't aware of the mathematical rendering behind the PDF- What is the correct mathematical formula to be applied in order to determine the correct dpi/quality of a the jpeg rendering based on the PDF information provided? In this one being a 8.5"x11" with a 300 x/y resolution? – TeckniX May 20 '11 at 20:13
  • The answer is that there isn't really an answer. *if* a page is single image, you would have to extract the image from that page (or at least its dimensions) then push (0, 0) and (w,h) through the transformation matrix that goes from image space ( (0,0) -> (1, 1) ) to PDF space to figure out the "optimal" PDF rendering resolution. In other words, straightforward if you have all that information. Getting that information is decidedly non-trivial. – plinth May 20 '11 at 20:38
  • That's precisely the issue I'm running into right now - Is getting all of the information from the existing PDF in order to get the rotation, dimensions, etc. and being able to create the correct ouput dimensions for the images to display in their proper resolution and rotation. Glad I'm not the only one struggling with some of these issues :) – TeckniX May 20 '11 at 21:24
1

The image within the PDF was scaled down to some size within the PDF (or it would be cropped when you look at it in Reader et al).

ImageMagick (which I ass-u-me imagick uses) uses GhostScript to convert PDFs to images. GhostScript is Quite Good at rendering PDF files. I have to wonder if you're passing some bad info along.

Can we see some code? Links to your input PDF and output image[s] would be nice too.


I just ran gs 8.71 on your PDF, and it rendered fine. What version of GhostScript are you using?

Mark Storer
  • 15,672
  • 3
  • 42
  • 80
  • Thanks Mark for the comment. The code is quite simple actually, and no dimensions are set, thus the PDF dimensions are being used. I'll edit my original post to add some code. – TeckniX May 20 '11 at 13:25
  • Looks like $im->getImageGeometry() will return the image size within the PDF - For some reason the pdf is in landscape and the size returned is in portrait? – TeckniX May 20 '11 at 16:53
  • 1
    The pages are rotated -90 degrees. That's a relatively rare way to do landscape, but perfectly legal. Other (more common) options are +90, and 11x8.5. – Mark Storer May 20 '11 at 17:08
  • 1
    Acrobat Pro SaveAs'ed the pages just fine... so if there's a problem, it's one that adobe software can handle (often the case... there's a lot of not-quite-valid PDF out there that adobe handles anyway). – Mark Storer May 20 '11 at 17:13
  • Having seen piles of errors in PDFs generated by GhostScript, I need to disagree with the "Quite Good" assessment. – plinth May 20 '11 at 17:34
  • Ah, but this is going the other way. PDF->Image "Quite Good **at rendering PDF files**". – Mark Storer May 20 '11 at 17:39
  • "Acrobat Pro SaveAs'ed the pages just fine..." AS JPEG. Important detail there. – Mark Storer May 20 '11 at 17:43
  • Mark, How were you able to figure out the angle/rotation of the PDF? I appreciate all the feedback and help. – TeckniX May 20 '11 at 20:12
  • You could figure that out with a text editor and a bare-bones understanding of PDF syntax. Open up your PDF in a text editor and search for "/Page". – Mark Storer May 23 '11 at 05:22
  • Mark, I'm currently using gs --version 8.70 – TeckniX May 23 '11 at 14:16
  • For the rotation angle, I was trying to see if there was an automated way to figure this out, but it could also be the version of imagick I'm using. I did see what you were talking about regarding the txt version: `CropBox[0 0 612 792]/Parent 210 0 R/Contents 3 0 R/Rotate -90/MediaBox[0 0 612 792]/Resources 2 0 R/Type/Page` is one of the first line, showing that it is rotated and that every /Page after that is a new PDF page. I'll try to create something a bit custom then. – TeckniX May 23 '11 at 14:24
  • It's a little more involved than that. For Example: The order in which page dictionaries appear in the PDF need not be their order in the document. I suggest you read the [PDF Spec](http://www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf) a bit, chapter 7 (Syntax), sections 3 (Objects) and 7 (document structure) in particular. That should give you enough information to make the correct searches. And there are quite a few PDF libraries out there that can do all this for you. I prefer [iText](http://itextpdf.com/), but there are lots of options floating around out there. – Mark Storer May 23 '11 at 16:22
  • The magic to all of this was to upgrade ImageMagick - As sad as it is, the ImageMagick that is currently available through the yum repository is the 6.2 version - I upgraded the version to try and get more information on the rotation of images (php function of imagick available in 6.4+) - Updated to ImageMagick 6.6.9 and this seems to have resolved a lot of my issues. Thanks for all the help and knowledge shared! – TeckniX May 23 '11 at 18:17