What is the optimal dictionary size for various compression algorithms?

Question

For various reasons, I am using LZMA2 to compress many varying size blocks of data. As there are many blocks being processed in parallel, the memory usage needs to be kept to a reasonable level. Given n bytes of data, what would be the optimal dictionary size to use? Typical source blocks vary in size from 4k to 4Mb.

I speculate that there's no point in having a dictionary size larger than the number of bytes to compress? I also speculate that if the data were to compress to half the size, there would be no point have a dictionary size larger than n/2 bytes.

Of course, this is only speculation, and some insight as to why this is or is not the case would be greatly appreciated!

Cheers

John

score 0 · Answer 1 · answered Jun 21 '19 at 05:57

There's probably no absolute optimum as it depends on your specific needs. Compression algorithms (I don't know about LZMA specifically though) often allow you to adjust parameters to find the best trade-off between memory consumption, compression speed and compression ratio. You will need to play with these parameters and see what effect they have given your actual workload. Most likely, the default parameters are pretty good, and tweaking is only needed if your requirements are unusual, for example if you have hard memory or time constraints.

Zerte · Answer 2 · 2019-06-21T09:29:07.600

The dictionary of size m is actually just the memory of the last seen bytes of the uncompressed data, capped by a limit of m. So for your usage, m := n will be optimal to make the best of the LZMA compression in a standalone way.

If your blocks have similarities, you can further improve the compression ratio by training LZMA with a sample block of size t which is known to both the compressor and decompressor (check "trained compression" on the Web for details). In that case, m := n + t would be perfect.

What is the optimal dictionary size for various compression algorithms?

2 Answers2