How can I optimize my pdf repository after splitting it by page?

Question

I have about 20 large pdfs which I have split by pages for easier access. When I split it by pages using qpdf I am observing an inflation of 10x in total size, meaning that I have some redundant data in all per-page pdfs. It is very likely stored fonts that are cause of the bloat. Is there a way to externalize these fonts (like the user can install those fonts beforehand on their devices)? My goal is that once I split the pdfs by page the total size should be within 1x-2x of original so that I can host it on my website.

Here is the sample pdf from repository

https://www.mea.gov.in/Images/CPV/Volume17_Part_III.pdf

Any help regarding pdf splitting is welcomed

Thanks!

PDFs can be optimized for streaming, repeated images stored only once at the first occurrence. No images out of page bounds. Vector images like EPS and SVG instead of hi-res JPEGs. Then instead of embedded font (subsets), use a standard PDF font, that already is installed with the PDF viewer. If you look at the document properties you'll see "Fast Web View: no". Splitting should be a last resort, as normally such a PDF is far smaller than 10 MB. — Joop Eggen, Sep 01 '22 at 15:18

score 0 · Answer 1 · answered Sep 01 '22 at 15:04

I split the file into files of one page each and then tried to squeeze them. There is no un-needed data:

$ cpdf -squeeze 641.pdf -o out.pdf
Initial file size is 947307 bytes
Beginning squeeze: 2178 objects
Squeezing... Down to 1519 objects
Squeezing page data and xobjects
Recompressing document
Final file size is 945176 bytes, 99.78% of original.

So no luck there. About 4/5 of the size of each file is the (uncompressed) XML metadata from the main file. You may well not need this. If so, you can run:

cpdf -remove-metadata in.pdf -o small.pdf

on each output file. This reduces the size of each file by about 5 times. Obviously if you're splitting into groups of more than one page, the effect will not be as large.

How can I optimize my pdf repository after splitting it by page?

1 Answers1