13

I have about 50-60 pdf files (images) that are 1.5MB large each. Now I don't want to have such large pdf files in my thesis as that would make downloading, reading and printing a pain in the rear. So I tried using ghostscript to do the following:

gs \
  -dNOPAUSE -dBATCH \
  -sDEVICE=pdfwrite \
  -dCompatibilityLevel=1.4 \
  -dPDFSETTINGS="/screen" \
  -sOutputFile=output.pdf \
    L_2lambda_max_1wl_E0_1_zg.pdf

However, now my 1.4MB pdf is 1.5MB large.

What did I do wrong? Is there some way I can check the resolution of the pdf file? I just need 300dpi images, so would anyone suggest using convert to change the resolution or is there someway I could change the image resolution (reduce it) with gs, since the image is very grainy when I use convert

How I use convert:

 convert \
     -units PixelsPerInch \
      ~/Desktop/L_2lambda_max_1wl_E0_1_zg.pdf \
     -density 600 \
      ~/Desktop/output.pdf

Example File

http://dl.dropbox.com/u/13223318/L_2lambda_max_1wl_E0_1_zg.pdf

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
dearN
  • 1,256
  • 4
  • 19
  • 40
  • Do you have images in the PDF file? Are they color or grayscale? – Kurt Pfeifle Aug 07 '12 at 17:24
  • @KurtPfeifle gs 9.04. The pdfs are images in color. I don't know if the fonts are embedded. I save figures from mathematica files as pdfs. They have numbers on the axes which are the only fonts – dearN Aug 07 '12 at 17:31
  • 1
    Run `pdffonts your.pdf` to get a list of fonts. If the `emb` column says `yes` the font is embedded... – Kurt Pfeifle Aug 07 '12 at 17:33
  • 1
    If you have Helvetica, Courier, Times then you don't necessarily need embedding. If you have *full* embedding of some fonts (column `sub` says `no`), you can switch to embed the used subset only. – Kurt Pfeifle Aug 07 '12 at 17:37
  • @KurtPfeifle `NO` for `emb`. So the fonts are NOT embedded. The font is Times new roman. – dearN Aug 07 '12 at 17:38
  • Did I get this right, all your 50-60 files are in fact 1-page images embedded into PDF? – Kurt Pfeifle Aug 07 '12 at 17:53
  • @KurtPfeifle [Example file here](http://dl.dropbox.com/u/13223318/L_2lambda_max_1wl_E0_1_zg.pdf) – dearN Aug 07 '12 at 18:04
  • higher density perhaps? – rogerdpack Feb 20 '14 at 00:30

2 Answers2

17

If you run Ghostscript -dPDFSETTINGS=/screen this is just a sort of shortcut. In fact you'll get (implicitly) a whole bunch of settings used, which you can query with the following command:

gs \
  -dNODISPLAY \
  -c ".distillersettings {exch ==only ( ) print ===} forall quit" \
| grep '/screen'

On my Ghostscript (v9.06prerelease) I get the following output (slightly edited to increase readability):

/screen 
  << /DoThumbnails false 
     /MonoImageResolution 300 
     /ColorImageDownsampleType /Average 
     /PreserveEPSInfo false 
     /ColorConversionStrategy /sRGB 
     /GrayImageDownsampleType /Average 
     /EmbedAllFonts true 
     /CannotEmbedFontPolicy /Warning 
     /PreserveOPIComments false 
     /GrayImageResolution 72 
     /GrayACSImageDict << 
                        /ColorTransform 1 
                        /QFactor 0.76 
                        /Blend 1 
                        /HSamples [2 1 1 2] 
                        /VSamples [2 1 1 2] 
                      >> 
     /ColorImageResolution 72 
     /PreserveOverprintSettings false 
     /CreateJobTicket false 
     /AutoRotatePages /PageByPage 
     /MonoImageDownsampleType /Average 
     /NeverEmbed [/Courier 
                  /Courier-Bold 
                  /Courier-Oblique 
                  /Courier-BoldOblique 
                  /Helvetica 
                  /Helvetica-Bold 
                  /Helvetica-Oblique 
                  /Helvetica-BoldOblique 
                  /Times-Roman 
                  /Times-Bold 
                  /Times-Italic 
                  /Times-BoldItalic 
                  /Symbol 
                  /ZapfDingbats] 
     /ColorACSImageDict << 
                          /ColorTransform 1 
                          /QFactor 0.76 
                          /Blend 1 
                          /HSamples [2 1 1 2] 
                          /VSamples [2 1 1 2] >> 
     /CompatibilityLevel 1.3 
     /UCRandBGInfo /Remove 
>>

I'm wondering if your PDFs are image-heavy, and if this sort of conversion does un-welcome things (f.e. re-sampling images with the 'wrong' parameters) which increase the file size...

If this is the case (image-heavy PDF), tell so, and I'll update this answer with a few suggestions....


Update

I had a look at the sample file provided by DNA. Interesting...

No, it does not contain any image.

Instead, it contains one large stream (compressed using /FlateDecode) which consists of roughly 700.000+ (!!) operations, mostly single vector operations in PDF language, such as:
m (moveto),
l (lineto),
d (setdash),
w (setlinewidth),
S (stroke),
s (closepath and stroke),
W* (eoclip),
rg and RG (setrgbcolor)
and a few more.

(That PDF code is very inefficiently written AFAICS (but does its job), because it does concatenate many short strokes instead of doing 'long' ones, and nearly each stroke defines the color again (even if it is the same as before), and has all the other overhead (start stroke, end stroke,...).

Ghostscript's -dPDFSETTINGS=/screen do not have any effect here (there are no images to downsample, for example). The increased file size (+48 kByte to be precise) is probably due to Ghostscript re-organizing some of the internal stroking etc. commands to a different order when it interprets the file.

So there is not much you can do about the PDF file size ...

  • ...unless you convert each of these PDF pages into a REAL image such as PNG:
    gs \
      -o out72.png \
      -sDEVICE=pngalpha \
       L_2lambda_max_1wl_E0_1_zg.pdf

(I used the pngalpha output to get transparent background.) The image dimensions of 'out.png' are 259x213px, the filesize is now 70 kByte. But I'm sure you'll not like the quality :-)

The output quality is 'bad' because Ghostscript uses a default resolution of 72 dpi.

Since you said you'd like to have 300dpi, the command becomes this:

gs \
  -o out300.png \
  -sDEVICE=pngalpha \
  -r300 \
   L_2lambda_max_1wl_E0_1_zg.pdf

The filesize now is 750 kByte, the image dimensions are 1080x889 Pixels.


Update 2

Since Curiosity is en vogue these days... :-) ...I tried to bring down the file size with the help of Adobe Acrobat X Pro on Mac.

You wanna know the results?

Performing a 'Save as... (PDF with reduced filesize)' -- which for me in the past always yielded very good results! -- created a 1,8++ MByte file (+29%). I guess this definitely puts Ghostscript's performance (file size increase +3%) into a realistic perspective !

Community
  • 1
  • 1
Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • If by image heavy you mean, too many images, yes. The pdf file that I have is actually just ONE Image. All my images are pdfs – dearN Aug 07 '12 at 19:15
  • Kurt, Thanks for the particularly well laid out answer. I'll try this soon and describe the result here. – dearN Aug 07 '12 at 21:27
  • What is the `13223318` number appended to the pdf file name? Is that just the folder path? – dearN Aug 07 '12 at 21:29
  • @DNA: number just slipped through my cut'n'paste check... I'll fix it. – Kurt Pfeifle Aug 07 '12 at 21:38
  • Thanks Kurt! That solves "the mystery of the vagrant number"! – dearN Aug 07 '12 at 23:09
  • @DNA: did you look at Update 2 of my answer? – Kurt Pfeifle Aug 07 '12 at 23:19
  • thanks. I did look at the second update. I have had this issue before but I think your explanation here does er... explain a lot as to why my final file sizes are much larger than the initial file sizes. Thanks! I am writing a bash script file to sniff out pdf files in my directory and convert them to lower sized pngs which are black and white as well! – dearN Aug 08 '12 at 16:45
  • Hi Kurt, I created a bash file that converts pdfs to gray scale pngs and then creates a partial latex document. I have included you as one of the authors. **File can be found [here](http://dl.dropbox.com/u/13223318/texer.sh)**. Please do let me know if thats alright and satisfactory. – dearN Aug 08 '12 at 17:35
  • Do you mean via the code that you just included as the second update? No. The final file sizes of the png files are about half the pdf files. – dearN Aug 08 '12 at 17:36
  • @DNA: my final PDF (`/screen`) filesizes using Ghostscript are NOT 'much' larger than your input filesizes. It's only about +3%... Is yours more?!? – Kurt Pfeifle Aug 08 '12 at 17:51
  • Oh... My when I start with my 1.5MB file and use `/screen` I end up with a 1.6MB file... – dearN Aug 08 '12 at 17:58
  • @DNA: I don't insist being one of the authors in your bash, but I don't mind either :-) Just spell-correct my name (2nd occurrance) if it stays. -- Also, I got half file size with *colored* PNG even! You don't need to convert to grayscale PDF first if you want 'gray' PNG. Last, your 'gray' PNG will be (again) in RGB color space anyway, and hence has same filesize as real color ones!! – Kurt Pfeifle Aug 08 '12 at 19:29
  • @DNA: Rounding error with 1.6MB. I had a *close* look and saw ~50.000 Bytes increase.... – Kurt Pfeifle Aug 08 '12 at 19:32
  • I just analyzed the pdf that I got from inserting all the png images via latex and ran a preflight test on it. It says that all images are less than 299 ppi. So Is there something I can do about this? I changed `-r300` to `-r400` and did fix the problem. However, the 17 images that I converted to a pdf file is a whopping 12 MB. And I have at least 50 more images like that! Is there any way I could further "compress" my `png` images without losing resolution? – dearN Aug 08 '12 at 20:06
  • @DNA: Less than 299dpi? My guess is that it's not *much* lower... Or your latex has run some (unwanted) downsampling when converting to PDF. Which dpi was it exactly?!? Anyway, see my new answer... – Kurt Pfeifle Aug 08 '12 at 20:10
  • Yes, with the `-r300` flag I had less than 299ppi show up as my preflight report. I did check your new answer out. Will try it now and post the results. – dearN Aug 08 '12 at 20:11
  • @DNA: You using Acrobat Pro for your preflight report? – Kurt Pfeifle Aug 08 '12 at 20:12
  • Yes I am using acrobat pro. The preflight profile is [this](http://dl.dropbox.com/u/13223318/Graduate-School-images.kfp) – dearN Aug 08 '12 at 20:15
3

DNA decided to go for grayscale PNGs. The way he's creating them is in two steps:

  1. Step 1: Convert a color PDF page (such as this) to a grayscale PDF page, using Ghostscript's pdfwrite device and the settings
    -dColorConversionStrategy=/Gray and
    -dProcessColorModel=/DeviceGray.
  2. Step 2: Convert the grayscale PDF page to a PNG, using Ghostscript's pngalpha device at a resolution of 300 dpi (-r300 on the GS commandline).

This reduces his initial file size of 1.4 MB to 0.7 MB.

But this workflow has the following disadvantage:

  • It looses all color info, without saving much disk space as compared to a color output written at the same resolution, directly from the PDF!

There are 2 alternatives to DNA's workflow:

  1. A one-step conversion of (color) PDF -> (color) PNG, using Ghostscript's pngalpha device with the original PDF as input (same settings of 300 dpi resolution). This would have this advantage:

    • It would keep the color information in the PNG output, requiring only a little more space on disk!
  2. A one-step conversion of (color) PDF -> grayscale PNG, using Ghostscript's pnggray device with the original PDF as input (same settings of 300 dpi resolution), with this mix of advantage/disadvantage :

    • It would loose the color information in the PNG output.
    • It would loose the transparent background that was preserved in DNA's workflow.
    • It would save lots of disk space, because the filesize would go down to about 20% of the output from DNA's workflow.

So you can make up your mind and see the output sizes and quality side-by-side, here is a shell script to demonstrate the differences:

#!/bin/bash
#
# Copywrite (c) 2012 <kurt.pfeifle@gmail.com>
# License: Creative Commons (CC BY-SA 3.0) 

function echo_do() {
        echo
        echo "Command:     ${*}"
        echo "--------"
        echo
        "${@}"
}

[ -d out ] || mkdir out

echo 
echo "    We assume all PDF pages are 1-page PDFs!"
echo "    (otherwise we'd have to include something like '%03d'"
echo "    into the output filenames in order to get paged output)"
echo

echo '
 # Convert Color PDF to Grayscale PDF.
 # If PDF has transparent background (most do), 
 # this will remain transparent in output.)
 # ATTENTION: since we don't use a resolution,
 # pdfwrite will use its default value of '-r720'.
 # (However, this setting will only affect raster objects...)
'
for i in *.pdf
do
echo_do gs \
 -o "out/${i}---pdfwrite-devicegray-gs.pdf" \
 -sDEVICE=pdfwrite \
 -dColorConversionStrategy=/Gray \
 -dProcessColorModel=/DeviceGray \
 -dCompatibilityLevel=1.4 \
  "${i}"
done

echo '
 # Convert (previously generated) grayscale PDF to PNG using Alpha channel
 # (Alpha channel can make backgrounds transparent)
'
for i in out/*pdfwrite-devicegray*.pdf
do
echo_do gs \
 -o "out/$(basename "${i}")---pngalpha-from-pdfwrite-devicegray-gs.png" \
 -sDEVICE=pngalpha \
 -r300 \
  "${i}"
done

echo '
 # Convert (color) PDF to grayscale PNG using Alpha channel 
 # (Alpha channel can make backgrounds transparent)
'
for i in *.pdf
do
# Following only required for 'pdfwrite' output device, not for 'pngalpha'!
#                -dProcessColorModel=/DeviceGray 
echo_do gs \
 -o "out/${i}---pngalphagray_gs.png" \
 -sDEVICE=pngalpha \
 -dColorConversionStrategy=/Gray \
 -r300 \
  "${i}"
done

echo '
 # Convert (color) PDF to (color) PNG using Alpha channel
 # (Alpha channel can make backgrounds transparent)
'
for i in *.pdf
do
echo_do gs \
 -o "out/${i}---pngalphacolor_gs.png" \
 -sDEVICE=pngalpha \
 -r300 \
  "${i}"
done

echo '
 # Convert (color) PDF to grayscale PNG 
 # (no Alpha channel here, therefor [mostly] white backgrounds)
'
for i in *.pdf
do
echo_do gs \
 -o "out/${i}---pnggray_gs.png" \
 -sDEVICE=pnggray  \
 -r300 \
  "${i}"
done

echo " All output to be found in ./out/ ..."
echo

Run this script and compare the different outputs side by side.

Yes, the 'direct-grayscale-PNG-from-color-PDF-using-pnggray-device' one may look a bit worse (and it doesn't sport the transparent background) than the other one -- but it is also only 20% of its file size. On the other hand, if you wan to buy a bit more quality by sacrificing a bit of disk space -- you could use -r400 instead of -r300...

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • The bash script you provided errored out with: `Error: /undefinedfilename in (col*.pdf)` How should I go about changing the file name in the script? – dearN Aug 08 '12 at 20:17
  • @DNA: Sorry. Just change all occurrances of 'col*.pdf' to '*.pdf'. I did run 1 test, found a directory which had 200 PDFs, two of which started their name with 'col', and I used these to test them... :-) Fixed in script above. – Kurt Pfeifle Aug 08 '12 at 20:44
  • Oh. Sure! `:)` I did try `pnggray` and like you said, there is an 80% reduction in file size from 1.2Mb to ~200KB. Thanks! – dearN Aug 08 '12 at 20:54