0

I want to compress .txt files that contains dates in yyyy-mm-dd hh:mm:ss format and english words that sometimes tend to be repeated in different lines.
I read some articles about compression algorithm and find out that in my case dictionary based encoding is better than entropy based encoding. Since I want to implement algorithm myself I need something that isn't very complicated. So I paid attention to LZW and LZ77, but can't choose between them, because conclusions of articles I found are contradictory. According to some articles LZW has better compression ratio and according to others leader is LZ77. So the question is which one is most likely will be better in my case? Is there more easy-to-implement algorithms that can be good for my purpose?

Okumo
  • 169
  • 1
  • 12
  • 2
    Experiment with readily accessible implementations. Does each file have to be decompressible individually? *Time stamps and words* looks a bit like *log files* - look for special solutions. Experiment with converting the time stamps to a more compact representation: 32 bits of seconds cover more than 136 years. – greybeard Mar 26 '19 at 22:27

1 Answers1

5

LZW is obsolete. Modern, and even pretty old, LZ77 compressors outperform LZW.

In any case, you are the only one who can answer your question, since only you have examples of the data you want to compress. Simply experiment with various compression methods (zstd, xz, lz4, etc.) on your data and see what combination of compression ratio and speed meets your needs.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Are new algorithms you mentioned not too hard to implement? – Okumo Feb 11 '19 at 11:48
  • And have they chance to surpass gzip on my type of data? – Okumo Feb 11 '19 at 12:01
  • 3
    Yes, on either speed or compression. Just try them. – Mark Adler Feb 11 '19 at 15:15
  • Thank you. Yep, I will. But implementing all of them will be a little bit too much, so I have to choose. Can you please tell which one of methods you mentioned is hard to implement, which one is not? – Okumo Feb 11 '19 at 17:46
  • 2
    They are already written for you. You don't need to implement the compressors. Just download them and try them. – Mark Adler Feb 11 '19 at 18:36
  • That's fine to find out which one will be better in my case, but as I mentioned in question I want to implement chosen algorithm myself. So there is no point in considering and testing complicated methods. That's why previous question is still actual. – Okumo Feb 11 '19 at 19:09
  • 1
    So test with the already-written compressors, decide which one fits your needs, and _then_ write that one yourself. – Mark Adler Feb 11 '19 at 19:59
  • Yes, that's exactly what I wanted to do. But I don't want to struggle too much while writing my code, so I need to exclude complicated algorithms from the checklist. Since you mentioned some of them, I thought you can point out how complicated they are. Is it the case? – Okumo Feb 11 '19 at 20:09
  • 1
    Sigh. I give up. – Mark Adler Feb 11 '19 at 23:17
  • Looks like there is some misunderstanding. I clearly understand, that I can make tests using already implemented algorithms. After that I can implement only the best one, BUT I still don't want it to be super complicated. That's why I asked you to point out the difficult ones, so that I won't even bother about them. If you are not sure about the answer, that's fine, the answer you already gave will be accepted. – Okumo Feb 11 '19 at 23:55
  • I implemented LZW and LZ77 in Crystal for fun and found that LZW is usually worse. The most natural implementation of LZ77 is LZSS -- that's what you'll come up with if you try to implement LZ77 as simply as possible and without wasting bytes needlessly (e.g. using 2 bytes to encode an offset to a 2 byte match is wasteful, so the match needs to be at least 3 bytes long). I think LZ4 is very closely related to LZ77/LZSS. They're all quite simple to implement the encoder/decoder and I'd encourage anyone to try it. – Desty Jul 30 '23 at 05:26