0

I was wondering which of the following scenarios will achieve the highest ratio with lossless algorithms applied to binary data with repeated data.

Am I correct to assume the compression ratio depends on patterns?

  1. Size
  2. Times repeated

For example the binary data:

10 10 10 10 10 10 10 10 pattern (10) size 2, pattern (10) repeated 8

1001 1001 1001 1001 pattern (1001) size 4, pattern (1001) repeated 4

0000000 11111111 pattern (0) size 1, pattern (0) repeated 8; pattern (1) size 1, pattern (1) repeated 8; Or 0000000 11111111 pattern (0000000) size 8, pattern (0000000) repeated 8; pattern (11111111) size 8, pattern (11111111) repeated 1;

Which of the above achieves the highest and lowest compression ratios?

Thanks in advance.

chineerat
  • 97
  • 2
  • 12
  • 1
    Your first two examples should compress in the same way, if the algorithm is smart. (They are equivalent -- the first one could also be viewed as being a pattern of size 4 and repeated 4 times.) More generally, any pattern that is N length and repeats M times can be viewed as a pattern that is N*C length and repeats M/C times, for some constant C. – cdhowie Oct 08 '12 at 22:03
  • 1
    Compression algorithms are very different. There must be dozens of LZ-style algorithms. Why are you asking? – usr Oct 08 '12 at 22:17
  • Hi all! Thank you for your responses. The reason why I asked is because I have an idea for an algorithm layer to apply before the lossless compression. This is all just concept, rigorous testing is yet to be done not to mention a prototype. I was curious as to the inputs to LZW and huffman lossless algorithm to ensure max compression. I have a flowchart of how I would like to apply the algorithm and its limits below: i46.tinypic.com/351vmll.png Your honest opinions? Feel free to poke holes – chineerat Oct 15 '12 at 14:18

1 Answers1

2

Those are all sequences that would be very unlikely to be seen in the wild. What is the point of the question?

Run-of-the-mill compressors are byte-oriented. As such, any pattern that results in simply the same byte repeated will give the highest compression ratio. E.g. 1032:1 in the limit for deflate. Other simple repetitions of short patterns will get very high compression ratios. E.g. again 1032:1 for deflate for patterns of two or three repeating bytes.

The limit on compression in these absurdly extreme cases is a function of the compression format, not of the data.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Hi all! Thank you for your responses. The reason why I asked is because I have an idea for an algorithm layer to apply before the lossless compression. This is all just concept, rigorous testing is yet to be done not to mention a prototype. I was curious as to the inputs to LZW and huffman lossless algorithm to ensure max compression. I have a flowchart of how I would like to apply the algorithm and its limits below: http://i46.tinypic.com/351vmll.png Your honest opinions? Feel free to poke holes. – chineerat Oct 14 '12 at 10:08
  • 1
    You have some research to do. LZW is obsolete, and Huffman coding is only part of other schemes to model redundancy. Read about LZ77, the Burrows-Wheeler transform, prediction by partial matching, and arithmetic coding. You can also take a look at XML-WRT which is a text preprocessor applied to improve subsequent lossless compression. – Mark Adler Oct 14 '12 at 14:39