24

This is a theoretical question, so expect that many details here are not computable in practice or even in theory.

Let's say I have a string s that I want to compress. The result should be a self-extracting binary (can be x86 assembler, but it can also be some other hypothetical Turing-complete low level language) which outputs s.

Now, we can easily iterate through all possible such binaries and programs, ordered by size. Let B_s be the sub-list of these binaries who output s (of course B_s is uncomputable).

As every set of positive integers must have a minimum, there must be a smallest program b_min_s in B_s.

For what languages (i.e. set of strings) do we know something about the size of b_min_s? Maybe only an estimation. (I can construct some trivial examples where I can always even calculate B_s and also b_min_s, but I am interested in more interesting languages.)

Albert
  • 65,406
  • 61
  • 242
  • 386
  • I recall some very clever programs from the olden days, such as bootstrap loaders that overwrote themselves several times. Likely, to achieve minimal overall size of the self-extracting program, the program could make use of its own text somehow -- eg, as a source of constants. – Hot Licks Sep 10 '11 at 01:37

4 Answers4

19

This is Kolmogorov complexity, and you are correct that it's not computable. If it were, you could create a paradoxical program of length n that printed a string with Kolmogorov complexity m > n.

Clearly, you can bound b_min_s for given inputs. However, as far as I know most of the efforts to do so have been existence proofs. For instance, there is an ongoing competition to compress English Wikipedia.

Matthew Flaschen
  • 278,309
  • 50
  • 514
  • 539
  • Yea, exactly that prize drove me to this question. :) However, such competitions/tries only give indications because they show lower limits for a particular example string. They don't give any answer about an average / real hard limit of some given language (e.g. XML with grammatical correct English as content). – Albert Jul 16 '10 at 03:50
  • 2
    Here is some good compression explanation I'd recommend for further reading: http://www.mattmahoney.net/dc/dce.html - and on the Hutter page, there's a link to http://cs.fit.edu/~mmahoney/compression/textdata.html which is nice to read, too. – schnaader Jul 16 '10 at 06:47
8

Claude Shannon estimated the information density of the English language to be somewhere between 0.6 and 1.3 bits per character in his 1951 paper Prediction and Entropy of Printed English (PDF, 1.6 MB. Bell Sys. Tech. J (3) p. 50-64).

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
phreeza
  • 155
  • 8
  • Hm, I wonder if Kolmogorov complexity is compatible with Shannons information density. From my intuition, Shannon information is just a stream of bits. E.g. the pixel-stream of a fractal image has still some high information density by definition of Shannon. So under this considerations, I wonder if 0.6 is really a good estimation. Maybe for English text which doesn't contain any redundant information. – Albert Jul 19 '10 at 00:18
  • Shannon Information makes a statement about the general statistical case, while Kolmogorov complexity is the information content of a single object. So in this example, Shannon information says something about the average character in an English text, while the Kolmogorov complexity is the information content of a specific body of text, eg your string s. – phreeza Jul 19 '10 at 10:34
  • But Shannon was a major formative figure in "information theory" and entropy, and ultimately it's entropy that is the issue here. ["Shannon's entropy represents an absolute limit on the best possible lossless compression of any communication"](http://en.wikipedia.org/wiki/Entropy_%28information_theory%29) – Hot Licks Sep 10 '11 at 01:34
0

The maximal (avarage) compression rate possible is 1:1.
The number of possible inputs is equal to the number of outputs.
It has to be to be able to map the output back to the input.
To be able to store the output you need container at the same size as the minimal container for the input - giving 1:1 compression rate.

Daniel
  • 30,896
  • 18
  • 85
  • 139
  • 2
    "The maximal (avarage) compression rate possible is 1:1." What does that actually mean? – Matthew Flaschen Jul 16 '10 at 04:10
  • It means that let's say you take all possible 100-byte strings and compress each one. The average length of your compression output is at least 100 bytes, so the average compression is 1:1 or worse. Of course, real-world data is not random so it would be best to say he's talking about optimal compression rate in the worst case. But it attempts to answer the question in the headline: maximum possible compression rate depends above all on the data. It doesn't really answer the body of the question... – jjrv Jun 17 '12 at 16:04
0

Basically, you need enough information to rebuild your original information. I guess the other answers are more helpful for your theoretical discussion, but just keep this in mind.

Chad
  • 3,159
  • 4
  • 33
  • 43