1

I have an old Kindle Dx. Owing to disabilities, I can't use tablets or other touch devices, and I transfer pdfs to the Kindle to read them. It requires pre-processing.

What is a good option to pre-process pdfs without rasterizing them?

[When rasterizing is acceptable:

  • k2pdfopt -mode copy for maps or for small text. This rasterizes, enhances contrast, and makes everything 1.4-compatible.

  • k2pdfopt -mode copy -dev dx for other works. This rasterizes to 800x1080, downsamples as needed, enhances contrast while making everything grayscale, and makes everything 1.4-compatible.

When rasterizing text is not acceptable:

  • gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf if you want to preserve graphics. This makes minimal changes to make everything 1.4 compatible.

  • gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \ -g800x1080 -r150 -dPDFFitPage \ -dFastWebView -sColorConversionStrategy=RGB \ -dDownsampleColorImages=true -dDownsampleGrayImages=true -dDownsampleMonoImages=true -dColorImageResolution=150 -dGrayImageResolution=150 -dMonoImageResolution=300 -dColorImageDownsampleThreshold=1.0 -dGrayImageDownsampleThreshold=1.0 -dMonoImageDownsampleThreshold=1.0 \ -sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf if you want moderate downsampling. This re-rasterizes existing raster images to fit 800x1080 and makes everything 1.4 compatible.

  • gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \ -g800x1080 -r150 -dPDFFitPage \ -dFastWebView -sColorConversionStrategy=Gray \ -dDownsampleColorImages=true -dDownsampleGrayImages=true -dDownsampleMonoImages=true -dColorImageResolution=75 -dGrayImageResolution=75 -dMonoImageResolution=150 -dColorImageDownsampleThreshold=1.0 -dGrayImageDownsampleThreshold=1.0 -dMonoImageDownsampleThreshold=1.0 \ -sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf if you want more aggressive downsampling. This re-rasterizes raster images to fit 400x540, makes them grayscale, and makes everything 1.4 compatible. Low image quality, but usually still recognizable.

  • gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dFILTERIMAGE -dFILTERVECTOR -sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf if you want to cut all graphics.

If using any of these options to pre-process for another device check its screen size in pixels. Don't worry too much about pixels per inch.]

[I.S. My goals are to fix pdfs so they 1. don't crash my Kindle, 2. don't freeze my Kindle or take too long to load each page, and 3. don't take up too much of the limited disk space on my Kindle. Preferably also 4. not rasterizing text, 5. not cutting out all images, which can sometimes lose tables, etc. and 6. not reflowing text, which will generally lose tabled. But I'm happy to downsample most images.]

[I.S. Note that I'm keeping copies of the originals. This is not a way to save disk space!]

For scanned pdfs, Willus's k2pdfopt is a great option. I've set up Mac Automator for

k2opt -mode copy -dev dx

or occasionally just -mode copy.

For pdf-born-pdfs, I'd rather not rasterize everything.

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -sstdout=%stderr -dNOPAUSE -dQUIET -dBATCH

can usually convert files, so the Kindle Dx can open them, but the Kindle will still slow, freeze, or crash with some pages.

One option is to combine Ghostscript and Mutool as follows:

  1. gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -sstdout=%stderr -dNOPAUSE -dQUIET -dBATCH to pre-process pdfs to remove passwords,
  2. mutool clean -g -g -d -s -l to sort out the junk, and then
  3. gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -sstdout=%stderr -dNOPAUSE -dQUIET -dBATCH again to get a smaller and faster pdf.

Note: I think Mutool's 3rd -g is the equivalent of Ghostscript's -dDetectDuplicateImages. Since it slows rendering down it may be better to do the opposite. I'm not sure how to set it to false. -dDetectDuplicateImages false? -uDetectDuplicateImages?

Note: I'm using gtime to time pdf rendering.

A single-step tool in a single application would help. And an image-reduction too would also help. Ghostscript's documentation is hard to follow.

  1. For cleanup, as an alternative to running mutool:

-dFastWebView might help.

-dNOGC indicates that Ghostscript does garbage collection by default.

  1. For image reduction:

-dPDFSETTINGS=/screen seems to work better in 9.50 than 9.23. /ebook might be better since it embeds all fonts.

-dFILTERIMAGE -dFILTERVECTOR also work better in 9.50 than 9.23, but are more drastic than I'd like.

A lot of settings seem to rely in input resolution and/or input page size.

-r seems to rely on input page size, rather than output page size. The Kindle Dx is 800 pixels by 1180 pixels.

-dDownScaleFactor reduces relative to input resolution.

-g800x1080 seems to crop pages, not shrink them.

I think -sDEVICE=pdfimage8 rasterizes everything, like k2pdfopt.

In some cases

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dFastWebView -uDetectDuplicateImages -dPDFSETTINGS=/ebook -sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH yields larger and slower files than just -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH

... I'm not sure what to make of these results.

Marja E
  • 21
  • 6

2 Answers2

2

You've asked an awful lot in here, which makes it rather difficult to read and answer cogently. You haven't really made it clear exactly what it is you want to achieve (you also haven't said what version of GS and MuPDF you are using).

Here are some points;

You don't need to 'clean out the junk' from PDF files produced by Ghostscript, these rarely have anything which can be removed, that's one reason people run PDF files through GS+pdfwrite (despite my saying constantly its a bad idea).

Using the -g switch with Mutool twice doesn't (AFAIK) do anything extra, but adding -d decompresses the files. You can have Ghostscript produce uncompressed PDF files too, use -dCompressPages=false -dCompressFonts=false -dCompressStreams=false.

When you pass your PDF through pdfwrite, then MuPDF, then pdfwrite again, you are risking quality degradation at every step, and the intermediate MUPDF step is unlikely to achieve anything. Most likely what you are doing is reducing the compression (and quality) of any JPEG compressed images, I doubt much else of use is happening.

I can't think why you'd want to not detect duplicate images, it really just makes the file bigger but if you want to you use the switch the same way as all the other GS switches; -dDetectDuplicateImages=false. Note this won't change the processing speed (and generally pdfwrite doesn't do rendering, but perhaps you mean on the target device...), the detection is done by applying an MD5 filter to every image as it is read, then comparing the MD5 hashes. Switching that off doesn't stop the MD5 it just stops the comparison.

If you find Ghostscript's documentation hard to follow, then use the Adobe documentation for distillerparams, that's where the majority of the pdfwrite settings come from (ie blame Adobe for this ;-)

-dFastWebView is (IMO) totally pointless, its there purely for compatibility with Adobe, and because a lot of people won't accept that its useless and insist on it. All it does is speed up loading of the first page of a PDF file, by PDF consumers which support it (which is practically none). And to do this it makes the file slightly bigger and more complicated.

Do NOT use -dNOGC, I keep telling people not to do this, its a debugging tool, it has no practical value in production other than to potentially make Ghostscript use more memory. Everything else you hear about it is cargo cult.

-r has nothing to do with the media szie at all, and does (more or less) nothing with pdfwrite. It sets the resolution of a page when rendering. Since you don't want to render to an image, setting the resolution is not a useful thing to do.

No pdfwrite settings rely on the "input resolution" because PDF (and PostScript) files don't have a resolution, they are vector page descriptions.

-dDownscaleFactor is a switch which only applies to the downscaling devices; tiffscaled and friends, which are rendering devices, it has no effect at all on pdfwrite.

Setting a fixed media size (using -g) does indeed rely on the resolution (because its specified in device pixesl) and does indeed only alter the media size, not the content. If you want to rescale the content to fit the new media, then you need to use -dFitPage. I can't really see why you would do that. Note that it doesn't affect the content of a PDF file (unless its a rendered image), it just makes all the numberic values smaller.

The pdfimage devices do indeed produce a PDF file where the entire content is an image; hence the name....

Now, if you could define what you actually want to achieve, I could make some suggestions.....

[EDIT] image downsampling

Firstly there are three controls which turn this feature on/off altogether;

-dDownsampleMonoImages, -dDownsampleGrayImages and -dDownsampleColorImages. Assuming you don't select a PDFSETTINGS (I would recommend you do not) these are all initially false. If you want to downsample any images you need to set the relevant mono/gray/color switch to true.

Once downsampling is enabled then you need to set the relevant ImageResolution and DownsamplingThreshold, there are again switches for each colour depth.

Now although PDF files don't have a resolution the images have an effective resolution, but its not easy to calculate (actually without a lot of effort its impossible). Its the number of image samples in the bitmap in each direction, divided by the area of the media covered by the image.

As an example if I have an image 100x100 samples, and that is placed on the page in a 1 inch square, then the resolution of the image is 100 dpi. If I then scale the image up so that it covers 2 inches square (but don't change the image data) then its 50 dpi.

So you need to decide what resolution looks OK on your device. You then set -dColorImageResolution=, -dMonoImageResolution, -dGrayImageResolution.

That's the 'target' resolution. But if the image is already close to that it can be wasteful to process it, so the Downsampling threshold is consulted. The actual resolution of the image in the input has to be the target resolution times the threshold, or more, to be reduced for output.

If we consider, for example, a target resolution of 300 and a threshold of 1.5 then the actual resolution of an image in the input file would have to exceed 450 dpi to be considered for downsampling.

Obviously you can set the threshold to 1.0 eg -dColorImageDownsampleThreshold=1.0

Finally there is the downsampling type, this is the filter used to create the lower resolution image from the higher. The simplest is /Subsample; basically throw away enough lines and columns until we reach the required resolution (this is only filter available for monochrome imsages, as all the others would change the colour depth). Then there's /Average which averages the value in each direction, effectively a bilinear filter. Finally there's /Bicubic which probably does the 'best' job but will be the slowest to process.

On top of all that you can choose the Image Filter (the compression filter) used to write the image data. We don't support JPXEncode in the AGPL version of Ghostscript and pdfwrite. That leaves you /CCITTFaxEncode (for monochrome) DCTEncode (JPEG) and FlateEncode (basically Zip compression). That's MonImageFilter, GrayImageFilter and ColorImageFilter.

If you want to use these you must first set AutoFilterGrayImages to false and/or AutoFilterColorImages to false, because if these are true the pdfwrite device will choose a compression method by looking to see which one compresses most. For Gray and Color images this will almost certainly be JPEG.

Final point is that linework (vector data) cannot be selectively rendered; either everything is rendered or everything is maintained 'as it was'. The only time (in general) that pdfwrite renders content is when transaprecny is present and the output CompatibilityLevel doesn't support transparency (1.3 or below). There are exceptions but they are quite uncommon.

You might want to consider setting the ColorConversionStrategy to either /DeviceRGB or /DeviceGray. I've no idea if you are using colour or grayscale devices, but if they are grayscale creating a gray PDF file would reduce the size and processing significantly. Creating an RGB file for colour devices probably makes sense too, in case the input is CMYK.

KenS
  • 30,202
  • 3
  • 34
  • 51
  • I want to fix pdf files so that they don't slow down, freeze, or crash my Kindle, so I can read them. Willus's k2pdfopt (with -mode copy -dev dx) can do that, but it rasterizes everything. So it's good for scanned pdfs but perhaps not the best choice for pdf-born-pdfs. Ghostscript can sometimes do that, but not always. Sometimes the resulting files are too big, or are too slow, or crash my Kindle. – Marja E Oct 29 '19 at 08:58
  • "When you pass your PDF through pdfwrite, then MuPDF, then pdfwrite again, you are risking quality degradation at every step," So? I'm usually keeping the original, I'm just trying to create Kindle-readable copies for my Kindle. "and the intermediate MUPDF step is unlikely to achieve anything." It reduces file size and loading time. I'm not quite sure why. Among other things, I'm trying to reduce file size and loading time. Especially loading time. – Marja E Oct 29 '19 at 09:08
  • I doubt its the step of passing through MuPDF that's doing that (especially when it comes to loading times). Its most likely something else, probably double-converting images to JPEG resulting in less image data. Of course, without an example its impossible to tell. I think its important to tell other potential readers that multiple conversions of a PDF file will lose quality, people often read these threads with other goals and don't understand the details, so I like to spell out the consequences. – KenS Oct 29 '19 at 09:38
  • I'm afraid that 'fix pdf files so that they don't slow down....' isn't a technical enough description of your goals. I have no idea why your Kindle has problems with specific PDF files, since I don't have a Kindle and don't know anything about its software. If you can characterise what you want in your PDF files I can tell you how to achieve that, I can't tell you why your Kindle crahses. – KenS Oct 29 '19 at 09:42
  • Ah, thank you. 1. PDF 1.4 compatibility. Older Kindles can't handle newer elements, such as passwords, jpeg 2000/jpx images, etc. 2. Reduced disk requirements and processing requirements on the device. 3. Keep clear text and tables. Reflow breaks tables. Rasterization is sometimes necessary, but not ideal. 4. I've had the most trouble with images, whether vector or raster. I'm happy to rasterize the vector images, downsample raster images, and sometimes delete both. – Marja E Oct 29 '19 at 19:29
  • PDF doesn't do reflow, so that's not an issue. -dCompatibilitylevel=1.4, if you don't set -sPassword or -sUserPasswiord then the file won't get password protected. Ghostscript's pdfwrite device never emits JPEG 2000. 'Reduced disk requirements and processing requirements' can't really comment, since I don;t know what that entails on the Kindle. The pdfwrite device generally only renders when absolutely required (eg transparency on a file < PDF 1.3). Be careful with the term 'images', in PDF that means a bitmap, vecors are just that, vectors, there is no concept of a 'vector image'. – KenS Oct 29 '19 at 19:35
  • You cannot selectively delete images with pdfwrite, the -dFILTERIMAGES switch will elide all of them, again, bitmaps only because that's what image means in PDF. Downsampling requiresyou to set several switches (this is an Adobe thing). I've edited my answer for those, too tedious in a comment. – KenS Oct 29 '19 at 19:37
  • Thank you. I'm still going over this. The device is 800x1080 pixels, grayscale. With scanned pdfs, k2pdfopt is a good option. k2pdfopt -mode copy rasterizes to a fairly high resolution, and exaggerates color differences. That works well for maps and for some other scans. k2pdfopt -mode copy -dev dx rasterizes to 800x1080, and exaggerated grayscale. That works well for most scanned books. – Marja E Nov 02 '19 at 03:44
  • With pdf-born-pdfs, ghostscript can be a good option. gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf makes minimal changes. That works well for some works, especially if I'm reading them on my computer more than my Kindle. But ... given freezes and crashes, I'd like some counterpart to -dev dx for other works. – Marja E Nov 02 '19 at 03:45
  • If a raster image takes up the whole screen, I want it to reduce it to no more than 800 pixels across and no more than 1080 pixels high. Given American paper sizes, the 800 is likely to be the limiting factor. If it takes up a part of the screen, I want an appropriate proportion of that 800 by 1080. If -dFitPage specifies size in 1/72" increments, then I think I could fit to 800x1080 (A bit more than 11" by 15") and then set resolution to 72 dpi. It's hard to check when everything's scattered among multiple web pages instead of a single searchable manual. – Marja E Nov 02 '19 at 03:45
  • Pretty much all the vector device specific settings are in one single HTML file, which is entirely searchable. You are mixing pixels and media size, the PDF file doesn't have a resolution, it has a media size, which may well not be a standard size at all, let alone a US one, and may be landscape even if it is. There's no setting (and no way i the code) to detect if an image covers the entire media, so you can't use that as a criterion. – KenS Nov 02 '19 at 08:58
  • -dFitPage doesn't have any 'incerements' at all. What you do is start with a media size and make it fixed, either by using -g (size in pixels) or specifying -dFIXEDMEDIA. The PDF file will then request the media size it wants to use and the FitPage code will determine the scaling required to fit the requested media onto the existing media. It then applies that scaling to the content. Note that using -g isn't hugely useful with pdfwrite as the output, because PDF files aren't raster data. – KenS Nov 02 '19 at 09:01
  • When producing a PDF file changing the resolution doesn't do much because the output isn't (usually) a bitmap. So changing the resolution of the pdfwrite device to 72 will result in different co-ordinates in the output PDF, but won't change the content appreciably. – KenS Nov 02 '19 at 09:03
  • But the pdfs usually contain fancy bitmap images which cause so much trouble, don't even show without the right pre-processing, and slow everything down. – Marja E Nov 03 '19 at 18:37
  • 1
    Then you need to use the image downsampling, as described in my edit above. Setting the resolution of the pdfwrite device, however, has no impact on bitmap data contained in the PDF file. Unless you downsample the images, they are retained unchanged in the output PDF file. – KenS Nov 03 '19 at 19:34
  • 2
    Wow best write up of GS image compression I've seen. Cheers! – Subtletree May 30 '21 at 22:38
1

Check my recent answer, it uses MuPDF, which is provided by the same Artiflex that distributes Ghostscript and produces very clear monochrome PDF-files. It would be interesting to know how these files would work with your kindle (if you still have the old one).

Compressing text heavy PDFs without ghostscript and only ImageMagik causes blurry text

Supernuija
  • 19
  • 2