I wanted to know if there was some optimal way to compress a csv file which has millions of rows that are repeated/are duplicated sequentially. Each row has 4-5 elements. There are only a few hundred unique rows, but because each of them appear so many times, the overall size o the file is large.
I am not familiar with the detailed algos used in tools such as gzip, bzip2, etc, but I was thinking along the lines of whether there was any way to instruct gzip or bzip2 of this pattern. For eg, if I had 1 million rows of a,b,c,d,e then internally this could be represented optimally as an entry for abcde and a count of the number if times it is repeated (eg. abcde repeated 2 M times). This would be more optimal than say, for the compression algorithm try to compress abcdeabcdeabcde... . I am looking for a general purpose way to optimise cases suh as these where data is in a sorted tabular format and contains duplicated rows/tuples.
Thanks in advance.