PDF compressing library/tool

Question

I am working on a project to reduce the size of the PDF's, compress them. I am wondering are there any good tools/library (.NET) in market that are really good. I did try few tools like Onstream Compression, but the results were not satisfactory.

It's a common misconception that "PDF is a file, files can be compressed, therefore PDFs can be compressed." That is not (always) true. 1. Most expansive data in a PDF -- text, images, fonts -- *are* already compressed by default, with the very efficient zlib Flate algorithm. 2. Images can be compressed "more", but only by converting them to the cheapest color model, downsampling, and/or lowering their JPEG quality. 3. You cannot 'downsample' pure text and vector image data. — Jongware, Jan 24 '14 at 23:37
(Afterthought) Items that *can* be (re-)compressed, are: 1. the object stream itself (PDF 1.5; see "Object Streams" in the reference); 2. items that were not compressed, or used the older LZW or (ancient) RLE compression; 3. Overcomplete embedded fonts might be replaced with a subsetted version; 4. Bitmap images *may* be compressed more efficient by tinkering with the `/Predictor` value; 5. A thorough vector data check may be able to discard invisible or double-rendered objects. — Jongware, Jan 24 '14 at 23:43

user2846289 · Answer 1 · 2014-01-26T11:38:34.073

Some additional (mega-)bytes can easily be squeezed out of PDFs. E.g., is a well known "PDF32000_2008.pdf" optimized enough? File size is 8,995,189 bytes. It uses object and xref streams, (nearly) no images, everything is packed tight. Or is it not?

Look at a page dictionary:

Dict:9 [1 0 R]
.   /Annots Array:3
.   /Contents Stream:3 [2 0 R]
.   /CropBox Array:4
.   /MediaBox Array:4
.   /Parent Dict:4 [124248 0 R]
.   /Resources Dict:4
.   /Rotate 0 (Number)
.   /StructParents 2 (Number)
.   /Type Page (Name)

Rotate 0 is a default, why is it there? What is CropBox there for? It defaults to MediaBox, and there's no page in this document with CropBox other than MediaBox. Why is MediaBox there? It's inheritable, all pages are the same size, so move it to Pages tree root! There are 756 pages, i.e. redundant (or useless) information replicated 756 times.

Look at typical Annotation dictionary:

Dict:6 [3548 0 R]
.   /A Dict:2
.   .   /S URI (Name)
.   .   /URI http://www.iso.org/iso/iso_catalogue/... (String)
.   /Border Array:3
.   .   [0] 0 (Number)
.   .   [1] 0 (Number)
.   .   [2] 0 (Number)
.   /Rect Array:4
.   .   [0] 82.14 (Number)
.   .   [1] 576.8 (Number)
.   .   [2] 137.1 (Number)
.   .   [3] 587.18 (Number)
.   /StructParent 3 (Number)
.   /Subtype Link (Name)
.   /Type Annot (Name)

There are thousands (maybe > 10'000?) link annotations in this document. /Type key is optional, why is it there? They are invisible rectangles, do you think their placement precision other than whole number of points is relevant? Round it to integer.

Look at the fragment of typical page content stream, text showing operator:

[(w)7(ed)-6( b)21(u)1(t shal)-6(l no)-6(t b)-6(e)1( ed)-6(ite)-6(d)1( un)-6(less the typef)23(aces wh)-6(ich )]TJ

Kerning of less than some value is all but invisible. This value may be debated, it's like JPEG compression quality level - acceptable to some, others disagree. I think that very conservative estimate (i.e. retaining most quality), with effect invisible to general person, is that kerning of absolute value less than 10 may be omitted. (Care must be taken to preserve justification, of course). (And I don't even mention that there are files out there with fractional kerning with precision of 3-6 decimal places! But not in this file)

And, with optimizations mentioned above, file size became 7,982,478 bytes. One megabyte shaved off. And it's certainly not the limit, there maybe others, that are hidden better, sources of optimization.

I don't disagree with you, but there belongs a *big* caution with your answer. The caution is that these optimisations a) require intimate knowledge of the PDF specification to perform, b) will work much better for some documents than others (for most graphic arts documents your optimisations will be close to zero%) and c) (and most importantly) relies on perfect PDF readers who implement the specification with 100% accuracy. If this style of optimisation is performed, it will require very thorough testing. — David van Driessche, Jan 26 '14 at 12:37
@David: and how exactly, sir, will omitting default or optional keys or storing "common Page attributes in the Pages object" (advice even from Reference 1.0, as well as "Omit default values", from there, too) compromise adherence to specification? Is there a document reader smart enough to understand compressed xref table (advice not disputed from other answers) but breaking on missing MediaBox in each page's dictionary? And, sure, any advice "will work much better for some documents than others". — user2846289, Jan 26 '14 at 17:48
@Vladimir, you don't have to believe me, but I have seen enough implementations of PDF libraries (including having worked on products using two different PDF libraries) to have become very conservative when it comes to touching PDF files in non-trivial ways. As far as some documents better than other - it's a quite dramatic difference in this case which was certainly worth mentioning. — David van Driessche, Jan 27 '14 at 09:43

score 4 · Answer 2 · answered Jan 26 '14 at 09:40

To add a few more notes to already good answers, there are a whole range of applications / libraries that will reduce the file size of PDF files. The first question, going along with @Jongware's answer, is whether anything can be done to begin with.

If your PDF files are coming from everywhere (you have no control over the source), gather a sample of files and determine what your requirements for the resulting PDFs are. If you only want to show them on screen for example, you have the option to resample images to a much lower resolution (be careful, that isn't the case any more for mobile use necessarily). If the PDFs are all internal you have it easier, because you can inspect them and see where you could save.

Use Adobe Acrobat's "Space Audit" feature. Adobe seems to find satisfaction in hiding this nice tool and moving it around between versions of Acrobat, but in Acrobat Pro XI it can be found by opening a PDF file and then selecting "File > Save as other > Optimized PDF..." (not "Reduced size PDF" as you would think). In the dialog window that shows up there's an "Audit space usage" button that will bring up an information window showing how much space elements in the PDF are using.

Depending on what you find there, there are multiple things you can do, most are already mentioned but here's an incomplete list:

Downsample images.
Change color spaces of images from CMYK to RGB. Be cautious about this as it will a) not provide the space savings you might think (because of compression) and b) might actually be counter-productive if you're unlucky (because of indexing and other neat image tricks).
Remove document and object level metadata (some sample sets of magazine page files I have contain more metadata than actual content).
Remove proprietary application data (Illustrator has a nasty habit of embedding the complete Illustrator document into a PDF file if you're not careful).
Compress object streams and XRef tables if you're sure all readers you're using will be able to handle that.
Use optimal compression IF your target readers will handle that (JBIG2, JPEG2000...)
Optimize the file structure (some bad PDF files don't optimise fonts and other objects and will have multiple copies scattered throughout the file).
Subset all fonts in the document.
Remove ICC profiles if they're not needed.

If you want to perform these tasks, there are many tools that can help. Either libraries to let you implement this yourself or commercial (and probably other) tools that will work though command-line with predefined actions. callas pdfToolbox is one of these tools (I'm connected to this company!), Enfocus PitStop has functionality in this area, Apago also has functionality here (though I'm not sure they have a command-line version of the top of my head).

score 1 · Answer 3 · edited May 23 '17 at 11:58

@Jongware is right. It's not likely that you will be able to significantly reduce size of a properly created PDF file.

But many PDFs in the wild can be compressed better. It's because many PDFs do not use object and cross-reference streams introduced in newer version of PDF Specification. Also, PDFs often contain unused objects that can be safely removed. And yes, images in PDFs can be resized / recompressed to further reduce size of a PDF.

If you are fine with commercial solutions then you might be interested in my answer to similar question. The answer contains code that shows how to compress PDFs with Docotic.Pdf library (I am one of developers of the library).

Hermolaou · Answer 4 · 2019-05-29T13:54:07.007

There is a PDFBeads Ruby gem.

It works with RubyInstaller 2.3.3 32-bit with DevKit. (Higher versions require unnecessarily large MSYS2 DevKit.)

For Windows these programs are needed:

ImageMagick 6.9.x 32-bit dll version with C/C++ development headers (http://ftp.icm.edu.pl/pub/graphics/ImageMagick/binaries or https://yadi.sk/d/4DGwC9Ie3Lkkgo)
jbig2 (http://soft.rubypdf.com/software/windows-version-jbig2-encoder-jbig2-exe or https://yadi.sk/d/4DGwC9Ie3Lkkgo)
libiconv (http://gnuwin32.sourceforge.net/packages/libiconv.htm)

iconv gem needs to be installed separately with

gem install iconv -- --with-iconv-include="<path>" --with-iconv-lib="<path>"

(works with simple, short paths)

PDF compressing library/tool

4 Answers4

Linked