Is there any way to highly compress PDF files through python?

Question

I have generated a PDF file through python using pdfkit library running on django server.

The size of pdf generated is 43.2 MB.

Total pages in pdf = 15.

Each page has 70 images, each image size = 623 bytes.

Tech Stack version used-

-Python = 3.8.16

-pdfkit = 1.0.0

-wkhtmltopdf = 0.12.6

-Django = 3.2.16

System OS = Ubuntu 22.04.2 LTS

Requirement is to compress this pdf file without compromising with the quality of content, images in it.

Any approach or suggestions for improvements in the things tried?

Things tried:

PDFNetPython3 python package- requires license/key file.
Ghost-script command from system(ubuntu) terminal to reduce file.
shrinkpdf.sh - file got from github.
ps2pdf command from system terminal.
pdfsizeopt
cpdf
Making zip of original file
Changing dpi value in pdfkit options.

File size after compression with quality of content maintained = 23.7 MB. Expectation is have more reduction in file size.

Your best approach to this problem is to reduce the size of the input images before handing them to the webkit html renderer behind pdfkit. That is the same task as making the html page lighter weight. Reduce image width/heights and use more aggressive compression. — O. Jones, Apr 08 '23 at 00:47

K J · Answer 1 · 2023-08-16T22:43:04.337

15 pages x 70 images x 623 compressed bytes images total = Over 650 KB

Thus the expectation is even with PDF placement overheads, it should be under 1 MB

The problem is that calculation can only apply under known conditions and since there is no minimal sample lets see what happens with similar so here is logo as 634 bytes (only 11 bytes bigger but readable however already very poor quality)

The aim is to not make it worse.

This will be converted by printing as PDF into one page of 165 objects (roughly 2 per image and in this case 2 dozen or so for other interactions)

So at this stage 1 page is 30,106 bytes (not bad so 15 pages should be under 1 MB)

I wont bother doing that as a good PDF writer should actually take all 15 identical pages and simply reference those as 1 to store and 14 duplicate name entries thus it would be very compact at about 35-40 KB.

so again without OP sample lets say 15 different pages must be "whatever" bytes

And the question was how to reduce the size , so the answer is there is no way to reduce file size without more degradation It is already full of severely compressed images and any more reduction can only be achieved by removing good contents.

By comparison here it is from WkhtmltoPDF at a FANTASTIC 2,513 bytes (Wow why so tiny? they are definitely separate images, so consider this file of 70 images has only 13 objects, I gave you the clue above), we can see as 600 bytes the images were pass poor. so trying to keep PDF quality up is an acrobatic challenge of keeping files as large as possible.

So if we make images better (bigger by higher density) the size increases along with the quality. still only 3.5 KB for 70 identical images but each is now 3 times bigger at 2,020 bytes.

What if we do --no-pdf-compression it will be faster and better quality but now 11,857 bytes and the JPEG image is still compressed as only 2,020 of those bytes so has not been altered it ALWAYS keeps its own compression. *** Altering PDF compression will not alter image compression*** as it is already the optimum for a JPEG/JFIF thus ONLY degrading can reduce the image storage, unless not correctly using one image for many.

7 0 obj
2020
endobj

6 0 obj <</Type /XObject/Subtype /Image/Width 50/Height 50/BitsPerComponent 8/ColorSpace /DeviceRGB
/Length 7 0 R
/Filter /DCTDecode
>>
stream
ÿØÿà JFIF  ` `  ÿÛ C

andersonhc · Answer 2 · 2023-04-12T17:31:57.430

0

I had a similar problem - one report generated with pdfkit was resulting in a 32mb file that I couldn't deliver through discord. I ended up changing my code to create the report using fpdf2 and the report now is less than 2mb

edited Apr 12 '23 at 17:31

answered Apr 12 '23 at 17:31

andersonhc

1
1

score -1 · Answer 3 · answered Apr 06 '23 at 09:46

-1

Compress images in options:

pdfkit.from_url('http://google.com', 'out.pdf', options={"image-quality": 30, "lowquality": True})

You can see other options you can pass here:

https://wkhtmltopdf.org/usage/wkhtmltopdf.txt

Good luck!

answered Apr 06 '23 at 09:46

FlipperPA

13,607
4
39
71

Vipul Bhatt · Answer 4 · 2023-08-22T07:25:50.877

-1

You can try Ghostscript following command

gs -q -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dBATCH -dColorImageResolution=150 -sOutputFile=outfile.pdf infile.pdf

Replace gs with your ghostscript executable.

You can change the -dColorImageResolution as required.

You can change -dPDFSETTINGS as /screen or /ebook or /printer or /prepress or /default

edited Aug 22 '23 at 07:25

answered Aug 16 '23 at 13:58

Vipul Bhatt

1
1
3

Is there any way to highly compress PDF files through python?

4 Answers4