9

I have a few long strings (~ 1.000.000 chars). Each string only contains symbols from the defined alphabet, for example

A = {1,2,3}

Sample strings

string S1 = "1111111111 ..."; //[meta complexity] = 0
string S2 = "1111222333 ..."; //[meta complexity] = 10
string S3 = "1213323133 ..."; //[meta complexity] = 100

Q What kind of measures can I use to quantify the complexity of these strings? I can see that S1 is less complex than S3, but how can I do that programmatically from .NET? Any algorithm or point to the tool/literature would be greatly appreciated.

Edit

I tried Shannon entropy, but it turned out that it is not really useful for me. I will have the same H value for these sequences AAABBBCCC and ABCABCABC and ACCCBABAB and BBACCABAC


This is what I ended up doing
Community
  • 1
  • 1
oleksii
  • 35,458
  • 16
  • 93
  • 163
  • 4
    Do you mean [entropy](http://en.wikipedia.org/wiki/Entropy_(information_theory))? – hammar May 21 '11 at 21:01
  • I tried that, but it turned out that it is not really useful for me. I will have the same **H** value for these sequences `AAABBBCCC` and `ABCABCABC` and `ACCCBABAB` and `BBACCABAC` – oleksii May 21 '11 at 21:04
  • 1
    in addition to **hammar**'s comment - do you mean Markov's entropy instead of Shannon's entropy? (same wikipedia link) – Premature Optimization May 21 '11 at 23:40
  • @user759588 @hammar thanks for the suggestions, but neither Shannon nor Markov rate(entropy) is a sufficiently good measures for me – oleksii May 22 '11 at 09:31
  • I think that you might find an answer to your question reading : http://en.wikipedia.org/wiki/Kolmogorov_complexity – Belgi Feb 10 '12 at 19:08
  • Thanks for the answer. I considered KC as the first measure, but it is incomputable, that is for a general case it is not possible to calculate the complexity of an arbitary string due to Halting problem (you never know if your solution is the correct one, therefore stop the program => the program will never stop looking for better solutions) – oleksii Feb 10 '12 at 19:28

1 Answers1

13

Compressing the strings using standard techniques such as zip gives a good indication of the compexity.

Good compression rate ≈ lower complexity
Bad compression rate ≈ higher complexity

aioobe
  • 413,195
  • 112
  • 811
  • 826
  • 4
    @user759588, sure it is. Step 1: Zip the string, Step 2: Return zipped size divided by original size. – aioobe May 22 '11 at 07:03
  • @aioobe, really? Your step 1 is more like transpacific voyage, dont you think? (metaphor covers both distance and expenses) – Premature Optimization May 22 '11 at 07:29
  • 1
    You're saying it's too complicated? Then say that (and read up on what an [algorithm](http://en.wikipedia.org/wiki/Algorithm) is). – aioobe May 22 '11 at 07:32
  • @aioobe, i'm saying what it is substantial resource wasting and completely opaque. I'm saying this is too crude. – Premature Optimization May 22 '11 at 07:52
  • Take it as a starting point. Read up on compression algorithms and take the bits and pieces relevant for this problem. It should be noted that this is an intricate and tricky problem. – aioobe May 22 '11 at 07:56
  • @aioobe @user759588 I really like this approach. It is a valid algorithm and it is not crude. It is already been tested in physics and actually I have already implemented that for myself, but I was wondering what will be other suggestions. Still your answer is 100% correct. – oleksii May 22 '11 at 09:36
  • 2
    +1 this is an algorithm and a rather clever one! If it is too slow try using FastLZ or something similar. Or you compress ist with RLE first and if the output is small then its low complexity. If not, zip it. If the zipped size is small its mid complexity and if zip can't do anything about the size, it's high complexity. – sl0815 May 22 '11 at 09:41
  • @aioobe This seems theoretically unsound. Depending on the compression algorithm, isn't the amount of compression already related to the entropy of the string? If entropy of the string isn't a valid measure of complexity, why would a first-order approximation to the entropy be good? Seems silly. – Patrick87 Feb 10 '12 at 20:57
  • @oleksii This seems theoretically unsound. Depending on the compression algorithm, isn't the amount of compression already related to the entropy of the string? If entropy of the string isn't a valid measure of complexity, why would a first-order approximation to the entropy be good? Seems silly. – Patrick87 Feb 10 '12 at 20:57
  • 2
    @Patrick87 compression of a string is a valid approximation of Kolmogorov complexity. See "Keogh EJ, Lonardi S, Ratanamahatana C(Ann) (2004) Towards parameter-free data mining. In: KDD Conference, Seattle, WA, pp 206–215" and [On data mining, compression, and Kolmogorov complexity](http://www.springerlink.com/content/1536t57kk558606r/) – oleksii Feb 11 '12 at 11:22