Sorting a file to optimize for compression efficiency

Question

We have some large data files that are being concatenated, compressed, and then sent to another server. The compression reduces the transmission time to the destination server, so the smaller we can get the file in a short period of time, the better. This is a highly time-sensitive process.

The data files contain many rows of tab-delimited text, and the order of the rows does not matter.

We noticed that when we sorted the file by the first field, the compressed file size was much smaller, presumably because duplicates of that column are next to each other. However, sorting a large file is slow, and there's no real reason that it needs to be in sorted other than that it happens to improves compression. There's also no relationship between what's in the first column and what's in subsequent columns. There could be some ordering of rows that compressed even smaller, or alternatively there could be an algorithm that could similarly improve compression performance but require less time to run.

What approach could I use to reorder rows to optimize the similarity between neighboring rows and improve compression performance?

It is possible that you simply need a larger dictionary size. If resorting the file improves compression, that seems to indicate that the dictionary is so short that by the time the next identical value comes around, the compression algorithm has forgotten about the previous value. Most compression algorithms allow you to change the size of the dictionary used to remember these values. — HugoRune, Jun 10 '14 at 20:13
Previously posted as an answer, but actually much too broad: you can try [clustering](https://en.wikipedia.org/wiki/Cluster_analysis) your data and then grouping by the clusters. Like compression itself, clustering is a hard problem usually tackled by heuristics. — Fred Foo, Jun 10 '14 at 20:17
Try bzip2. That does a Burrows-Wheeler pass before the LZ pass, essentially doing the sort for you. — moonshadow, Jun 10 '14 at 20:21
I can confirm on this. We use LZMA and it is multiple times smaller compared to zip for our SQLite db. Your mileage may vary, though — benjist, Jun 22 '14 at 23:38

score 1 · Accepted Answer · edited Jul 06 '15 at 20:26

Here are a few suggestions:

Split the file into smaller batches and sort those. Sorting multiple small sets of data is faster than sorting a single big chunk. You can also easily parallelize the work this way.
Experiment with different compression algorithms. Different algorithms have different throughput and ratio. You are interested in algorithms that are on the pareto frontier of those two dimensions.
Use bigger dictionary sizes. This allows the compressor to reference data that is further in the past.

Note, that sorting is important no matter what algorithm and dictionary size you chose because references to old data tend to use more bits. Also, sorting by a time dimension tends to group rows together that come from a similar data distribution. For example, Stack Overflow has more bot traffic at night than during the day. Probably, the UserAgent field value distribution in their HTTP logs greatly varies with the time of day.

Splitting the file into pieces that can be sorted in memory and sorting those is much more efficient than sorting the whole concatenated file and gets nearly the same compression efficiency. Thanks! — heyitsbmo, Jun 11 '14 at 18:57

score 0 · Answer 2 · answered Jun 10 '14 at 21:47

If the columns contain different types of data, e.g.

Name, Favourite drink, Favourite language, Favourite algorithm

then you may find that transposing the data (e.g. changing rows into columns) will improve compression because for each new item the zip algorithm just needs to encode which item is favourite, rather than both which item and which category.

On the other hand, if a word is equally likely to appear in any column, then this approach is unlikely to be of any use.

score -1 · Answer 3 · answered Jun 22 '14 at 23:36

-1

Just in: Simply try using a different compression format. We found for our application (compressed SQLite db) that LZMA / 7z compresses about 4 times better compared to zip. Just saying, before you implement anything.

answered Jun 22 '14 at 23:36

benjist

2,740
3
31
58

Sorting a file to optimize for compression efficiency

3 Answers3

Linked