Storing and serving many compressed archives with shared underlying content

Question

I have a web server that has many compressed archive files (zip files) available for download. I would like to drastically reduce the disk footprint those archives take on the server.

The key insight is that those archives are in fact slightly different versions of the same uncompressed content. If you uncompressed any two of these many archives and ran a diff on the results, I expect you would find that the diff is about 1% of the total archive size.

Those archives are actually JAR files, but the compression details are — I believe — irrelevant. But this explains, that serving those archives in a specific compressed format is non-negotiable : it is the basic purpose of the server.

In itself, it is not a problem for me to install differential storage for the content of those archives, drastically reducing the disk footprint of the set of archives. There are numerous ways of doing this, using delta encoding or a compressed filesystem that understands sharing (e.g. I believe btrfs understands block sharing, or I could use snapshotting to enforce it).

The question is, how do I produce compressed zips from those files ? The server I have has very little computational power, certainly not enough to recreate JARs on the fly from the block-sharing content.

Is there a programmatic way to expose the shared content at the uncompressed level to the compressed level ? An easily-translatable-to-zip incremental compressed format ?

Should I look for a caching solution coupled with generating JARs on the fly ? This would at least alleviate the computational pain from generating the JARs that are the most requested.

There is specialized hardware that can produce zips very fast, but I'd rather avoid the expense. It's also not a very scalable solution as the number of requests to the server grows.

score 1 · Answer 1 · answered Mar 27 '13 at 15:06

If the 1% differences are smeared across all of the entries in all of the jar files, then there's not much you can do without having to recompress a lot.

If on the other hand the 1% differences are concentrated in a few % of the jar entries, with most of the jar entries unchanged, then there's hope. You can keep all of the individual jar entries in their own jar files on the server, and for each jar file you want to serve, just keep a list of those individual jar entry files to combine. It would be easy to write a fast utility to take a set of jar files and merge them into a single jar file. If there isn't one already.

score 1 · Answer 2 · answered Feb 19 '16 at 01:15

One approach I've used in the past is to log for some time the actual requests for the zip files. If you find that the requests are highly skewed, then you may be able to use caching to alleviate the cost of producing zip files on the fly.

Basically, implement your differential storage along the lines as you suggest. Allocate also some amount, say 10%, of your total storage for a LRU (or whatever other replacement algorithm you feel like) for the actual .zipped files. Every time a user requests the zip, you serve it from the cache if it is ready, or generate it on the fly and put it in the cache if not.

In the general case this may not work well, but in the common case that actual requests are typically to a small concentrated number of files, it may solve the problem.

Otherwise, I see your options as:

Use delta encoding on disk and then change the format your clients expect for responses. For example, instead of zip, you can serve them a format which is basically the bits of the delta-encoded files they need to reconstruct the file. On the server side, you save most of the work since you are just serving files more or less unmodified from disk, and then the client has to put them together (the existing client already has to unzip the files, so perhaps this is not an undue burden).
Carefully look at the .zip format and store your files in a specialized way that does most of the .zip work ahead of time. For example, something like a delta encoding, but with the actual hard part of match-finding stored on disk, such that encoding a file can be a very fast process. This would require someone with sophisticated knowledge of the zip format to design, however.

Storing and serving many compressed archives with shared underlying content

2 Answers2