1

A random string should be incompressible.

pi = "31415..."
pi.size  # => 10000
XZ.compress(pi).size  # => 4540

A random hex string also gets significantly compressed. A random byte string, however, does not get compressed.

The string of pi only contains the bytes 48 through 57. With a prefix code on the integers, this string can be heavily compressed. Essentially, I'm wasting space by representing my 9 different characters in bytes (or 16, in the case of the hex string). Is this what's going on?

Can someone explain to me what the underlying method is, or point me to some sources?

Kappie001
  • 898
  • 2
  • 10
  • 20
  • It's up to us to guess what programming language and compression algorithm you're using? – Robby Cornelissen May 13 '15 at 19:50
  • possible duplicate of [How to compress a random string?](http://stackoverflow.com/questions/22536489/how-to-compress-a-random-string) – demonplus May 13 '15 at 19:50
  • @RobbyCornelissen I think the algorithm is XZ - it's in the code. – skrrgwasme May 13 '15 at 19:51
  • Why would a string be incompressible? It's just a series of bits, just like files. Why would any compression algorithm be able to compress file bits and not string bits? – skrrgwasme May 13 '15 at 19:52
  • Sorry, I should have made it clearer that the algorithm used is xz. But, every well known compressor can do this. I've edited my question to reflect my best guess. – Kappie001 May 13 '15 at 20:01
  • 1
    Note that π is rather special, and can in fact be compressed infinitely. E.g. `int a=10000,b,c=2800,d,e,f[2801],g;main(){for(;b-c;)f[b++]=a/5;for(;d=0,g=c*2;c-=14,printf("%.4d",e+d/a),e=d%a)for(b=c;d+=f[b]*a,f[b]=d%--g,d/=g--,--b;d*=b);}`. However xz cannot detect this. – Mark Adler May 13 '15 at 23:43
  • True. The Kolmogorov complexity of pi is very low. But so is any other pseudorandomly generated string (if you have the seed and generator.) – Kappie001 May 14 '15 at 09:40

2 Answers2

4

It's a matter of information density. Compression is about removing redundant information.

In the string "314159", each character occupies 8 bits, and can therefore have any of 28 or 256 distinct values, but only 10 of those values are actually used. Even a painfully naive compression scheme could represent the same information using 4 bits per digit; this is known as Binary Coded Decimal. More sophisticated compression schemes can do better than that (a decimal digit is effectively log210, or about 3.32, bits), but at the expense of storing some extra information that allows for decompression.

In a random hexadecimal string, each 8-bit character has 4 meaningful bits, so compression by nearly 50% should be possible. The longer the string, the closer you can get to 50%. If you know in advance that the string contains only hexadecimal digits, you can compress it by exactly 50%, but of course that loses the ability to compress anything else.

In a random byte string, there is no opportunity for compression; you need the entire 8 bits per character to represent each value. If it's truly random, attempting to compress it will probably expand it slightly, since some additional information is needed to indicate that the output is compressed data.

Explaining the details of how compression works is beyond both the scope of this answer and my expertise.

Keith Thompson
  • 254,901
  • 44
  • 429
  • 631
  • I figured something like this (I've used a prefix code on the first nine integers as an example in my question now.) I accepted your answer, since it explains very clearly my initial confusion. But I'm still interested in what compression programs actually DO. – Kappie001 May 13 '15 at 20:16
  • @Geert: I recommend [The Data Compression Book](http://www.amazon.com/Data-Compression-Book-Mark-Nelson/dp/1558514341), which provides details and implementations of a number of different compression algorithms. I read that book many years ago and it was really helpful. – Greg Hewgill May 13 '15 at 20:20
  • @Geert: http://stackoverflow.com/help/dont-ask: "If you can imagine an entire book that answers your question, you’re asking too much." There are a *lot* of books about data compression. – Keith Thompson May 13 '15 at 20:38
  • But even if you only use 4 bits: if you go up to multiple digits, it is apparent that only few 4-bit tuples are used, so a dictionary works to even improve the compression rate. 10 bits can store three digits already, e.g. – IceFire Jul 28 '22 at 07:29
0

In addition to Keith Thompson's excellent answer, there's another point that's relevant to LZMA (which is the compression algorithm that the XZ format uses). The number pi does not consist of a single repeating string of digits, but neither is it completely random. It does contain substrings of digits which are repeated within the larger sequence. LZMA can detect these and store only a single copy of the repeated substring, reducing the size of the compressed data.

Community
  • 1
  • 1
DrGoldfire
  • 986
  • 9
  • 13
  • I'm skeptical that this actually saves space. Any time a compressor stores repeated substrings only once, it saves the space of N-1 copies of the substring, but at the cost of additional space for metadata. In principle, if the digits of pi are mathematically random, compression significantly below log2(10) bits per digit should not be possible. – Keith Thompson May 13 '15 at 20:37
  • 1
    Setting aside the fact that you can compute them, the digits of π are random at face value. Finding matching strings will not compress the sequence. – Mark Adler May 13 '15 at 23:45