26

I have been searching a lot on Google about how to compress existing pdf (size). My problem is

  1. I can't use any application, because it needs to be done by a C# program.

  2. I can't use any paid library as my clients don't want to go out of Budget. So a PAID library is certainly a NO

I did my home-work for last 2 days and came upon a solution using iTextSharp, BitMiracle but to no avail as the former decrease just 1% of a file and later one is a paid.

I also came across PDFcompressNET and pdftk but i wasn't able to find their .dll.

Actually the pdf is insurance policy with 2-3 images (black and white) and around 70 pages accounting to size of 5 MB.

I need the output in pdf only(can't be in any other format)

Prahalad Gaggar
  • 11,389
  • 16
  • 53
  • 71
  • What's BitMiracle compression result? – Martheen Dec 05 '12 at 09:43
  • I can't use bitmiracle as it is a paid library!!! – Prahalad Gaggar Dec 05 '12 at 09:48
  • 3
    Are you sure compression will help at all? Try to create some test cases of PDF files and compress them with various off-the-shelf programs/methods. What is the compression rate on these? Maybe you're trying to do something which isn't worth it/possible? – André C. Andersen Dec 05 '12 at 09:49
  • I am going to be very specific about the generation of pdf now, 1. Its a pdf which will contain an insurance policy generated by a application. 2. The application generates about 50 .doc files 1st then it converts this 50 .doc files to .pdf files and lastly it merges all this 50 .pdf files to result in 1 .pdf files!!! This procedure CANNOT be change by any means. – Prahalad Gaggar Dec 05 '12 at 10:01
  • @AndréChristofferAndersen Exactly. I'm curious if it could be compressed at all. PDF is actually already compressed, and Prahalad, even if the application created inefficient PDF compression, I don't see any better way to 'compress' it other than reducing the image quality – Martheen Dec 05 '12 at 11:27
  • 3
    If the file you referred to is representative, the step "merging of the 50 pdf files" unfortunately used the iTextSharp 4.1.2 library in the wrong way (using PdfWriter instead of PdfCopy for this task)... Well, at first glance, your main problem may be the 70 included font subset files; many of them in spite of compression require more than 80 KB each! Unfortunately recombining multiple different subsets of the same font generally is hard (the content of most pages of your document may have to be rewritten) and is not as such explicitly supported by iText(Sharp); this would be quite a feat! – mkl Dec 05 '12 at 11:31
  • And why oh why it can't merge the doc files and *then* convert the merged doc into PDF? – Martheen Dec 05 '12 at 11:33
  • @Martheen I didn't use any program to reduce the size and about the process of merging- since it is generating an insurance policy the .doc file are converted to .pdf periodically in process....and we can't 1st merge the doc file and then use it to convert pdf !!! I don't to offend you in any way....but i hope you can help me in solving the problem!!! – Prahalad Gaggar Dec 05 '12 at 11:49
  • Ah, I see. Can the resulting PDF reconverted to DOC and then merged as DOC and then converted to PDF? – Martheen Dec 05 '12 at 11:51
  • @mki i can't reduce the font size at all, also the headers should be use wherever possible since it is a insurance policy!!! – Prahalad Gaggar Dec 05 '12 at 11:52
  • @Martheen are u sure that if we first merge all doc file to single doc file and then use the result doc to pdf will ultimately reduce the size of resultant pdf!!! if you are sure then i am going to change my code, but i need to be sure about it!!! – Prahalad Gaggar Dec 05 '12 at 11:57
  • @Prahalad I'm afraid that in a process like yours, an evergrowing PDF file where the separate steps don't interact for optimization, you very likely end up with big files. I do wonder, though, is it necessary to embed fonts in the first place? If the presence of the same fonts on all machines of the process could be guaranteed, the docToPdf export could be attempted without font embedding. It even might be possible to afterwards embed the font once and point all font references to that one embedding. This would have to be tested before, though, and depends on the cooperation of all components. – mkl Dec 05 '12 at 12:31
  • @Prahalad As you say that is an option, I in your place would first try merging the docs and exporting the merged doc to PDF. A proof of concept (a function taking the docs of one such insurance case, merging them, then exporting to PDF) should take not more than an hour, and it is quite likely (obviously we cannot guarantee anything, not knowing the doc files and software versions in question) that this will make things better. – mkl Dec 05 '12 at 12:43
  • @mki the application is stored on cloud , so we don't need to worry about font...time is not a criteria for me !!! keep giving inputs... i am very glad to receive your help... also if its possible can u change this question to a discussion!!! – Prahalad Gaggar Dec 05 '12 at 13:06
  • @Prahalad For further input I'm waiting for the result of the proof of concept first merging the docs and then exporting to PDF. Obviously we're done if the resulting PDF is considerably smaller. If it isn't, please supply it for further analysis. – mkl Dec 06 '12 at 13:42
  • @mki i did my test and my result reply me that it really doesn't matter on how and when you combine your file!!!please check the link, i think it could solve our problem!!! [link](http://www.neeviapdf.com/support/examples/pdfcompress/) – Prahalad Gaggar Dec 06 '12 at 14:11
  • @Prahalad Are you going to release source code of your solution using AGPL? If no, then you will need to purchase commercial license for iTextSharp. – Bobrovsky Dec 06 '12 at 15:58
  • @Martheen Docotic.Pdf can reduce size of already compressed files. For example, PDF Reference can be reduced from about 30MB down to about 17 MB. And such result can be achieved without recompression of images or any other destructive changes. Of course, all files are different and size reduction can be not that great in many cases. – Bobrovsky Dec 06 '12 at 16:02
  • @Bobrovsky do you have any idea about how can we reduce existing PDF!!! Thank-you for your presence in my Question – Prahalad Gaggar Dec 07 '12 at 04:58
  • 1
    @Vijay without any further explanation I doubt your bounty is well-spent. A new question with your very requirements and attempts (I hope you have made attempts) would have been better. – mkl Oct 05 '15 at 10:09
  • It's very simple. Client: I want X. You: X costs $$$. Client: I don't want to pay $$$. You: Then, you don't get X. – Chris Pratt Oct 15 '15 at 13:37

4 Answers4

16

Here's an approach to do this (and this should work without regard to the toolkit you use):

If you have a 24-bit rgb or 32 bit cmyk image do the following:

  • determine if the image is really what it is. If it's cmyk, convert to rgb. If it's rgb and really gray, convert to gray. If it's gray or paletted and only has 2 real colors, convert to 1-bit. If it's gray and there is relatively little in the way of gray variations, consider converting to 1 bit with a suitable binarization technique.
  • measure the image dimensions in relation to how it is being placed on the page - if it's 300 dpi or greater, consider resampling the image to a smaller size depending on the bit depth of the image - for example, you can probably go from 300 dpi gray or rgb to 200 dpi and not lose too much detail.
  • if you have an rgb image that is really color, consider palettizing it.
  • Examine the contents of the image to see if you can help make it more compressible. For example, if you run through a color/gray image and fine a lot of colors that cluster, consider smoothing them. If it's gray or black and white and contains a number of specks, consider despeckling.
  • choose your final compression wisely. JPEG2000 can do better than JPEG. JBIG2 does much better than G4. Flate is probably the best non-destructive compression for gray. Most implementations of JPEG2000 and JBIG2 are not free.
  • if you're a rock star, you want to try to segment the image and break it into areas that are really black and white and really color.

That said, if you do can do all of this well in an unsupervised manner, you have a commercial product in its own right.

I will say that you can do most of this with Atalasoft dotImage (disclaimers: it's not free; I work there; I've written nearly all the PDF tools; I used to work on Acrobat).

One particular way to that with dotImage is to pull out all the pages that are image only, recompress them and save them out to a new PDF then build a new PDF by taking all the pages from the original document and replacing them the recompressed pages, then saving again. It's not that hard.

List<int> pagesToReplace = new List<int>();
PdfImageCollection pagesToEncode = new PdfImageCollection();

using (Document doc = new Document(sourceStream, password)) {

    for (int i=0; i < doc.Pages.Count; i++) {
        Page page = doc.Pages[i];
        if (page.SingleImageOnly) {
            pagesToReplace.Add(i);
            // a PDF image encapsulates an image an compression parameters
            PdfImage image = ProcessImage(sourceStream, doc, page, i);
            pagesToEncode.Add(i);
        }
    }

    PdfEncoder encoder = new PdfEncoder();
    encoder.Save(tempOutStream, pagesToEncode, null); // re-encoded pages
    tempOutStream.Seek(0, SeekOrigin.Begin);

    sourceStream.Seek(0, SeekOrigin.Begin);
    PdfDocument finalDoc = new PdfDocument(sourceStream, password);
    PdfDocument replacementPages = new PdfDocument(tempOutStream);

    for (int i=0; i < pagesToReplace.Count; i++) {
         finalDoc.Pages[pagesToReplace[i]] = replacementPages.Pages[i];
    }

    finalDoc.Save(finalOutputStream);

What's missing here is ProcessImage(). ProcessImage will rasterize the page (and you wouldn't need to understand that the image might have been scaled to be on the PDF) or extract the image (and track the transformation matrix on the image), and go through the steps listed above. This is non-trivial, but it's doable.

plinth
  • 48,267
  • 11
  • 78
  • 120
7

I think you might want to make your clients aware that any of the libraries you mentioned is not completely free:

  • iTextSharp is AGPL-licensed, so you must release source code of your solution or buy a commercial license.
  • PDFcompressNET is a commercial library.
  • pdftk is GPL-licensed, so you must release source code of your solution or buy a commercial license.
  • Docotic.Pdf is a commercial library.

Given all of the above I assume I can drop freeware requirement.

Docotic.Pdf can reduce size of compressed and uncompressed PDFs to different degrees without introducing any destructive changes.

Gains depend on the size and structure of a PDF: For small files or files that are mostly scanned images the reduction might not be that great, so you should try the library with your files and see for yourself.

If you are most concerned about size and there are many images in your files and you are fine with loosing some of the quality of those images then you can easily recompress existing images using Docotic.Pdf.

Here is the code that makes all images bilevel and compressed with fax compression:

static void RecompressExistingImages(string fileName, string outputName)
{
    using (PdfDocument doc = new PdfDocument(fileName))
    {
        foreach (PdfImage image in doc.Images)
            image.RecompressWithGroup4Fax();

        doc.Save(outputName);
    }
}

There are also RecompressWithFlate, RecompressWithGroup3Fax and RecompressWithJpeg methods.

The library will convert color images to bilevel ones if needed. You can specify deflate compression level, JPEG quality etc.

Docotic.Pdf can also resize big images (and recompress them at the same time) in PDF. This might be useful if images in a document are actually bigger then needed or if quality of images is not that important.

Below is a code that scales all images that have width or height greater or equal to 256. Scaled images are then encoded using JPEG compression.

public static void RecompressToJpeg(string path, string outputPath)
{
    using (PdfDocument doc = new PdfDocument(path))
    {
        foreach (PdfImage image in doc.Images)
        {
            // image that is used as mask or image with attached mask are
            // not good candidates for recompression
            if (!image.IsMask && image.Mask == null && (image.Width >= 256 || image.Height >= 256))
                image.Scale(0.5, PdfImageCompression.Jpeg, 65);
        }

        doc.Save(outputPath);
    }
}

Images can be resized to specified width and height using one of the ResizeTo methods. Please note that ResizeTo method won't try to preserve aspect ratio of images. You should calculate proper width and height yourself.

Disclaimer: I work for Bit Miracle.

Bobrovsky
  • 13,789
  • 19
  • 80
  • 130
  • 3
    Be SUPER careful when you scale/recompress with JPEG and change the jpeg quality. I know of a programmer who was assigned this task with legal documents for archiving and the result was a number of court cases that had to be thrown out because the only copies of the documents were now unreadable. – plinth Dec 07 '12 at 15:39
5

Using PdfSharp

public static void CompressPdf(string targetPath)
{
    using (var stream = new MemoryStream(File.ReadAllBytes(targetPath)) {Position = 0})
    using (var source = PdfReader.Open(stream, PdfDocumentOpenMode.Import))
    using (var document = new PdfDocument())
    {
        var options = document.Options;
        options.FlateEncodeMode = PdfFlateEncodeMode.BestCompression;
        options.UseFlateDecoderForJpegImages = PdfUseFlateDecoderForJpegImages.Automatic;
        options.CompressContentStreams = true;
        options.NoCompression = false;
        foreach (var page in source.Pages)
        {
            document.AddPage(page);
        }

        document.Save(targetPath);
    }
}
Simon
  • 33,714
  • 21
  • 133
  • 202
  • Thanks @Simon. It was my very first task (Which I failed terribly). Now I have started working on BI applications and Database – Prahalad Gaggar Aug 08 '19 at 05:18
  • I try this library on .NET Core 5 and lib version 1.50.5147. This code snippet is raising some errors. – toha May 17 '23 at 04:18
3

GhostScript is AGPL licensed software that can compress PDFs. There is also an AGPL licensed C# wrapper for it on github here.

You could use the GhostscriptProcessor class from that wrapper to pass custom commands to GhostScript, like the ones found in this AskUbuntu answer describing PDF compression.

Community
  • 1
  • 1
brismuth
  • 36,149
  • 3
  • 34
  • 37