I have a web server that has many compressed archive files (zip files) available for download. I would like to drastically reduce the disk footprint those archives take on the server.
The key insight is that those archives are in fact slightly different versions of the same uncompressed content. If you uncompressed any two of these many archives and ran a diff on the results, I expect you would find that the diff is about 1% of the total archive size.
Those archives are actually JAR files, but the compression details are — I believe — irrelevant. But this explains, that serving those archives in a specific compressed format is non-negotiable : it is the basic purpose of the server.
In itself, it is not a problem for me to install differential storage for the content of those archives, drastically reducing the disk footprint of the set of archives. There are numerous ways of doing this, using delta encoding or a compressed filesystem that understands sharing (e.g. I believe btrfs understands block sharing, or I could use snapshotting to enforce it).
The question is, how do I produce compressed zips from those files ? The server I have has very little computational power, certainly not enough to recreate JARs on the fly from the block-sharing content.
Is there a programmatic way to expose the shared content at the uncompressed level to the compressed level ? An easily-translatable-to-zip incremental compressed format ?
Should I look for a caching solution coupled with generating JARs on the fly ? This would at least alleviate the computational pain from generating the JARs that are the most requested.
There is specialized hardware that can produce zips very fast, but I'd rather avoid the expense. It's also not a very scalable solution as the number of requests to the server grows.