Lossless compression theory, is compression ratio based on size of pattern and times repeated?

Question

I was wondering which of the following scenarios will achieve the highest ratio with lossless algorithms applied to binary data with repeated data.

Am I correct to assume the compression ratio depends on patterns?

Size
Times repeated

For example the binary data:

10 10 10 10 10 10 10 10 pattern (10) size 2, pattern (10) repeated 8

1001 1001 1001 1001 pattern (1001) size 4, pattern (1001) repeated 4

0000000 11111111 pattern (0) size 1, pattern (0) repeated 8; pattern (1) size 1, pattern (1) repeated 8; Or 0000000 11111111 pattern (0000000) size 8, pattern (0000000) repeated 8; pattern (11111111) size 8, pattern (11111111) repeated 1;

Which of the above achieves the highest and lowest compression ratios?

Thanks in advance.

Your first two examples should compress in the same way, if the algorithm is smart. (They are equivalent -- the first one could also be viewed as being a pattern of size 4 and repeated 4 times.) More generally, any pattern that is N length and repeats M times can be viewed as a pattern that is N*C length and repeats M/C times, for some constant C. — cdhowie, Oct 08 '12 at 22:03
Compression algorithms are very different. There must be dozens of LZ-style algorithms. Why are you asking? — usr, Oct 08 '12 at 22:17
Hi all! Thank you for your responses. The reason why I asked is because I have an idea for an algorithm layer to apply before the lossless compression. This is all just concept, rigorous testing is yet to be done not to mention a prototype. I was curious as to the inputs to LZW and huffman lossless algorithm to ensure max compression. I have a flowchart of how I would like to apply the algorithm and its limits below: i46.tinypic.com/351vmll.png Your honest opinions? Feel free to poke holes — chineerat, Oct 15 '12 at 14:18

score 2 · Accepted Answer · answered Oct 09 '12 at 01:41

2

Those are all sequences that would be very unlikely to be seen in the wild. What is the point of the question?

Run-of-the-mill compressors are byte-oriented. As such, any pattern that results in simply the same byte repeated will give the highest compression ratio. E.g. 1032:1 in the limit for deflate. Other simple repetitions of short patterns will get very high compression ratios. E.g. again 1032:1 for deflate for patterns of two or three repeating bytes.

The limit on compression in these absurdly extreme cases is a function of the compression format, not of the data.

answered Oct 09 '12 at 01:41

Mark Adler

101,978
13
118
158

Hi all! Thank you for your responses. The reason why I asked is because I have an idea for an algorithm layer to apply before the lossless compression. This is all just concept, rigorous testing is yet to be done not to mention a prototype. I was curious as to the inputs to LZW and huffman lossless algorithm to ensure max compression. I have a flowchart of how I would like to apply the algorithm and its limits below: http://i46.tinypic.com/351vmll.png Your honest opinions? Feel free to poke holes. – chineerat Oct 14 '12 at 10:08
1

You have some research to do. LZW is obsolete, and Huffman coding is only part of other schemes to model redundancy. Read about LZ77, the Burrows-Wheeler transform, prediction by partial matching, and arithmetic coding. You can also take a look at XML-WRT which is a text preprocessor applied to improve subsequent lossless compression. – Mark Adler Oct 14 '12 at 14:39

Lossless compression theory, is compression ratio based on size of pattern and times repeated?

1 Answers1