0

I have about 20 large pdfs which I have split by pages for easier access. When I split it by pages using qpdf I am observing an inflation of 10x in total size, meaning that I have some redundant data in all per-page pdfs. It is very likely stored fonts that are cause of the bloat. Is there a way to externalize these fonts (like the user can install those fonts beforehand on their devices)? My goal is that once I split the pdfs by page the total size should be within 1x-2x of original so that I can host it on my website.

Here is the sample pdf from repository

https://www.mea.gov.in/Images/CPV/Volume17_Part_III.pdf

Any help regarding pdf splitting is welcomed

Thanks!

pkgitlab
  • 3
  • 2
  • PDFs can be optimized for streaming, repeated images stored only once at the first occurrence. No images out of page bounds. Vector images like EPS and SVG instead of hi-res JPEGs. Then instead of embedded font (subsets), use a standard PDF font, that already is installed with the PDF viewer. If you look at the document properties you'll see "Fast Web View: no". Splitting should be a last resort, as normally such a PDF is far smaller than 10 MB. – Joop Eggen Sep 01 '22 at 15:18

1 Answers1

0

I split the file into files of one page each and then tried to squeeze them. There is no un-needed data:

$ cpdf -squeeze 641.pdf -o out.pdf
Initial file size is 947307 bytes
Beginning squeeze: 2178 objects
Squeezing... Down to 1519 objects
Squeezing page data and xobjects
Recompressing document
Final file size is 945176 bytes, 99.78% of original.

So no luck there. About 4/5 of the size of each file is the (uncompressed) XML metadata from the main file. You may well not need this. If so, you can run:

cpdf -remove-metadata in.pdf -o small.pdf

on each output file. This reduces the size of each file by about 5 times. Obviously if you're splitting into groups of more than one page, the effect will not be as large.

johnwhitington
  • 2,308
  • 1
  • 16
  • 18