-1

Is there a way to achieve the same compression than (great compression ratio and quality but it's slow and can break pdfs):

pdfimages -tiff $1 pdf_images
convert pdf_images-* -alpha off -monochrome -compress Group4 -density 250 ${1%.pdf}.compressed.pdf
rm pdf_images-*

By only using ghostscript instead ?

Tried playing around with dPDFSETTINGS, dGrayImageDownsampleType, sColorConversionStrategy but the result was usually lower quality or bigger in size.

PDF consists of scanned pages (one image per page)

I usually use something like the following with GS (there's still something missing because images aren't converted...is this by design?):

gs \
    -q \
    -dNOPAUSE \
    -dBATCH \
    -dSAFER \
    -sDEVICE=pdfwrite \
    -dPDFSETTINGS=/screen \
    -dEmbedAllFonts=false \
    -dSubsetFonts=false \
    -dGrayImageDownsampleType=/Bicubic \
    -dGrayImageResolution=250 \
    -dMonoImageDownsampleType=/Bicubic \
    -dMonoImageResolution=250 \
    -sProcessColorModel=DeviceGray \
    -dProcessColorModel=/DeviceGray \
    -sColorConversionStrategy=/Mono \
    -dOverrideICC \
    -sOutputFile=output.pdf \
    input.pdf

Random PDF Sample from Google: https://www.2ndcollege.com/colleges/gcet/btech/sem5/ic/socio/notes/unit1.pdf

Original: 5.6MB

GS: 1.4MB (not mono)

ghostscript output ghostscript output zoom

PDFImages + ImageMagick: 1.4MB (only images are converted)

imagemagick output imagemagick output zoom

nathan
  • 9,329
  • 4
  • 37
  • 51
  • I'm not sure you are comparing like with like, and you haven't supplied any examples to look at. The Ghostscript pdfwrite device has a **wide** range of controls, but you're going to have to decide yourself what you need. First thing to say is "don't use PDFSETTINGS" that sets a load of controls in one go and almost certainly none of them will help you. Second thing is that you appear to be producing monochrome output and yet you are playing with the Gray image parameters, that's not going to work. – KenS May 16 '19 at 07:10
  • Note that if convert is rendering to a monochrome image form something which is not monechrome, then pdfwrite isn't going to help you, it doesn't change colour data into monochrome. – KenS May 16 '19 at 07:10
  • The use case is for scanned docs mostly, mono or grayscale images. I tried with both gray and mono downsampling settings but quality was rather low in comparison with my first approach. – nathan May 16 '19 at 12:41
  • To be honest, without some examples to look at, its not really possible to comment. Its certainly the case that a dedicated image processing application is likely to produce better results when downscaling an image than a PDF processing application. Ghostscript and the pdfwrite device aren't intended for that purpose, its best to use the correct tool for the job. If the content started as an image, then process the image before you make a PDF from it, that's always going to produce the best result. – KenS May 16 '19 at 13:07
  • and you haven't said which version of Ghostscript you are using, on which platform, or the command lines you have tried. I'm willing to look at this, but not by fumbling in the dark and trying to guess what IM has done, or what you consider to be 'better quality'. I'd suggest you put a simple file, the IM result and the GS result somewhere public and post the URL here, along with the command lines you used for each process. – KenS May 16 '19 at 13:08
  • Updated q with my gs command. Will upload sample next week. "Better quality" as in noticeable artifacts due to compression, pixelation. – nathan May 17 '19 at 22:08

1 Answers1

2

Adding as an answer because its too long for a comment.

The artefacts you are referring to are, I think, caused by JPEG quantisation. The original image has been decompressed, downsampled to a lower resolution, and then recompressed. Since you haven't selected any other compression method, the default for the /screen PDFSETTINGS is used, which is JPEG for Gray and colour images and CCITT Fax for mono images.

You could easily avoid that by using a different compression filter, though of course that would not produce as much compression of the output.

There are several suggestions I can make; firstly don't use PDFSETTINGS unless you are completely sure you want all the things it is doing. In general I would expect better results by leaving most settings alone and simply applying the switches you need.

Given that these are scanned pages, there is no point in setting any of the Font related parameters (unless invisible fonts have been added).

You've set ProcessColorModel twice, once as a name and once as a string. In fact, if you use ColorConversionStrategy, you shouldn't set it at all, and if you aren't using ColorConversionStrategy then it won't have any effect, so you can just drop these two entirely.

There is no ColorConversionStratefy of /Mono, and trying to set that causes errors for me. There appears to have been a bug introduced with the ColorConversionStrategy in the current release. If you set Gray you will actually get RGB. In order to get Gray you actually need to request CMYK. Obviously that's been fixed but in the meantime all the spaces are 'off by one'. sRGB->CMYK, CMYK->Gray and Gray->RGB. LeaveColorUnchanged is unaffected.

Of course this means that your setting of the Gray and Mono Image parameters is having no effect (at least not on the colour images anyway). This is why you get a low output size, and also why the result is heavily downsampled and quantised.

Now, as I've already said, you can't get Ghostscript's pdfwrite to produce monochrome output, only grayscale. Reducing the image data by a factor between 8 and 24 is where most of the gains are coming form I believe. So frankly there's no way you are going to get down to the same output size using pdfwrite without heavily downsampling the images. And if you do that, then the quality is going to suffer.

This command line:

\ghostpdl\debugbin\gswin32c -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=out.pdf -sColorConversionStrategy=CMYK -dPDFSETTINGS=/screen -dGrayImageDownsampleType=/Bicubic -dGrayImageFilter=/FlateEncode -dAutoFilterGrayImages=false unit1.pdf

produces a gray output file 2.1 MB in size, but the extreme downsampling has resulted in very blurry output, I don't think you will like it at all. You could change the amount of downsampling, but that of course will result in a larger output file. You could leave the compression filter unchanged (DCTEncode == JPEG), but that will get you compression artefacts.

Basically, as I said right at the beginning, if you want to manipulate image data, the best way to do it is with a tool designed to manipulate images, not one designed to render PostScript/PDF files.

You could, with some effort render the original pages to a btimap format with Ghostscript, using a stochastic screening method as IM appears to have used, then read the images back into Ghostscript to produce a PDF file, but that hardly seems like its easier than using IM as you are now.

KenS
  • 30,202
  • 3
  • 34
  • 51