1

I wanted to know if there was some optimal way to compress a csv file which has millions of rows that are repeated/are duplicated sequentially. Each row has 4-5 elements. There are only a few hundred unique rows, but because each of them appear so many times, the overall size o the file is large.

I am not familiar with the detailed algos used in tools such as gzip, bzip2, etc, but I was thinking along the lines of whether there was any way to instruct gzip or bzip2 of this pattern. For eg, if I had 1 million rows of a,b,c,d,e then internally this could be represented optimally as an entry for abcde and a count of the number if times it is repeated (eg. abcde repeated 2 M times). This would be more optimal than say, for the compression algorithm try to compress abcdeabcdeabcde... . I am looking for a general purpose way to optimise cases suh as these where data is in a sorted tabular format and contains duplicated rows/tuples.

Thanks in advance.

xbsd
  • 2,438
  • 4
  • 25
  • 35

1 Answers1

2

You should create your own custom format. Something like:

0 -> end of file 1 -> row follows (self-terminating with an end-of-line) 2..n -> repeat the previous row that many times

The number can be a variable-length integer, where the high bit of the byte being zero indicates the end of the integer, a one there indicating that there are more bytes. Then the low seven bits of each byte are concatenated to make the integer. So small repeat counts (< 128) take only one byte. Longer ones take more bytes. You can concatenate them either least-significant first or most-significant first, so long as you're consistent on both ends.

Once you have removed the repeated rows in this way, then compress with gzip.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158