Optimal compression of csv files with repeated rows

Question

I wanted to know if there was some optimal way to compress a csv file which has millions of rows that are repeated/are duplicated sequentially. Each row has 4-5 elements. There are only a few hundred unique rows, but because each of them appear so many times, the overall size o the file is large.

I am not familiar with the detailed algos used in tools such as gzip, bzip2, etc, but I was thinking along the lines of whether there was any way to instruct gzip or bzip2 of this pattern. For eg, if I had 1 million rows of a,b,c,d,e then internally this could be represented optimally as an entry for abcde and a count of the number if times it is repeated (eg. abcde repeated 2 M times). This would be more optimal than say, for the compression algorithm try to compress abcdeabcdeabcde... . I am looking for a general purpose way to optimise cases suh as these where data is in a sorted tabular format and contains duplicated rows/tuples.

Thanks in advance.

score 2 · Answer 1 · answered Jul 15 '13 at 04:44

You should create your own custom format. Something like:

0 -> end of file 1 -> row follows (self-terminating with an end-of-line) 2..n -> repeat the previous row that many times

The number can be a variable-length integer, where the high bit of the byte being zero indicates the end of the integer, a one there indicating that there are more bytes. Then the low seven bits of each byte are concatenated to make the integer. So small repeat counts (< 128) take only one byte. Longer ones take more bytes. You can concatenate them either least-significant first or most-significant first, so long as you're consistent on both ends.

Once you have removed the repeated rows in this way, then compress with gzip.

Optimal compression of csv files with repeated rows

1 Answers1