5

I have lots of huge text-files which need to be compressed with the highest ratio possible. Compression speed may be slow, as long as decompression is reasonably fast.

Each line in these files contains one dataset, and they can be stored in any order.

A Similar problem to this one: Sorting a file to optimize for compression efficiency

But for me compression speed is not an issue. Are there ready-to-use tools to group similar lines together? Or maybe just an algorithm I can implement?

Sorting alone gave some improvement, but I suspect a lot more is possible.

Each file is about 600 million lines long, ~40 bytes each, 24GB total. Compressed to ~10GB with xz

Craden
  • 145
  • 6
  • 1
    I suppose, if editing distance between subsequent strings will be minimal, your compression ratio will be sufficiently good... For example, try [this way](https://stackoverflow.com/a/29155276/7879193). And, of course, experiment with different compression algorithms... – Stanislav Kralin Jun 28 '17 at 20:31
  • Thank you for this example- I´m currently experimenting with k-means-clustering with quite promising results. – Craden Jun 29 '17 at 08:13
  • 1
    Can you rearrange the order of the fields within each line? (assuming all lines would have the same order) – samgak Jun 29 '17 at 23:51
  • The fields within each line could be rearranged, but I don´t think it would benefit as they are all very similar – Craden Jun 30 '17 at 12:38
  • `ready-to-use tools to group similar` I'd expect BWT to catch most of the possible benefit. As of 2017, 24 GB is above the easily handled data window size for most machines, and not all formats/utilities/algorithms support large windows. – greybeard Jul 02 '17 at 02:38
  • `experimenting with k-means-clustering with quite promising results` Please report method and results of such experiments in the question to guide what not to bother to recommend. – greybeard Jul 02 '17 at 02:40

1 Answers1

1

Here's a fairly naïve algorithm:

  • Choose an initial line at random and write to the compression stream.
  • While remaining lines > 0:
    • Save the state of the compression stream
    • For each remaining line in the text file:
      • write the line to the compression stream and record the resulting compressed length
      • roll back to the saved state of the compression stream
    • Write the line that resulted in the lowest compressed length to the compression stream
    • Free the saved state

This is a greedy algorithm and won't be globally optimal, but it should be pretty good at matching together lines that compressed well when followed one after the other. It's O(n2) but you said compression speed wasn't an issue. The main advantage is that it's empirical: it doesn't rely on assumptions about which line order will compress well but actually measures it.

If you use zlib, it provides a function deflateCopy that duplicates the state of the compression stream, although it's apparently pretty expensive.

Edit: if you approach this problem as outputting all lines in a sequence while trying to minimize the total edit distance between all pairs of lines in the sequence then this problem reduces to the Travelling Salesman Problem, with the edit distance as your "distance" and all your lines as the nodes you have to visit. So you could look into the various approaches to that problem and apply them to this. Even then, the optimal TSP solution in terms of edit distance isn't necessarily going to be the file that compresses the smallest/

samgak
  • 23,944
  • 4
  • 60
  • 82
  • 1
    I experimented with this solution and came to two problems: First, the size of the output stream is not always smaller if the current line compresses better. Although they are about ~40 bytes each, it seems like gzip and xzip need more to be judged by output size. Second is actually runtime. The files are huge - 600 Million lines each, O(n²) will take years to finish. – Craden Jun 29 '17 at 08:10
  • 2
    You should probably add those stats (600 million lines at ~40 bytes each) to the question as its relevant information – samgak Jun 29 '17 at 23:49