3

I have a large theoretical string (104 characters long) database generation program that returns results measured in petabytes. I don't have that much computing power so I would like to filter the low complexity strings from the database.

My grammer is a modified form of the English alphabet with no numerical characters. I read about Kolmogorov Complexity and how it is theoretically impossible to calculate but I just need something basic in C# using compression.

Using these two links

I came up with this:

MemoryStream ms = new MemoryStream();

GZipStream gzip2 = new GZipStream(ms, CompressionMode.Compress, true);

byte[] raw = Encoding.UTF8.GetBytes(element);
gzip2.Write(raw, 0, raw.Length);
gzip2.Close();

byte[] zipped = ms.ToArray(); // as a BLOB
string smallstring = Convert.ToString(zipped); // as a string
// store zipped or base64
byte[] raw2 = Encoding.UTF8.GetBytes(smallstring);
int startsize = raw.Length;
int finishsize = raw2.Length;
double percent = Convert.ToDouble(finishsize) / Convert.ToDouble(startsize);
if (percent > .75)
{
    ///output
}

my first element is:

HHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFH

and it compresses to a finishsize of 13 characters but this other chatcter set

mlcllltlgvalvcgvpamdipqtkqdlelpklagtwhsmamatnnislmatlkaplrvhitsllptpednleivlhrwennscvekkvlgektenpkkfkinytvaneatlldtdydnflflclqdtttpiqsmmcqylarvlveddeimqgfirafrplprhlwylldlkqmeepcrf

also evaluates to 13. There is a bug but I don't know how to fix it.

Community
  • 1
  • 1
  • Does the compressed buffer uncompress to the original string? If so, then it probably works for some definition of "works". – millimoose Aug 24 '12 at 19:01
  • 1
    You could also try the other compression algorithms from [#ziplib](http://www.icsharpcode.net/OpenSource/SharpZipLib/Default.aspx) - bzip2 specifically should get a better compression ratio than gzip, and thus be a more accurate estimate of the information content. (LZMA should be better yet from the general purpose algorithms, and a C# implementation is available in the [7zip SDK](http://www.7-zip.org/sdk.html).) That said, the better the compression (and the more accurate the estimate), the more resources it will take. – millimoose Aug 24 '12 at 19:05
  • Also, why are you converting the zipped stream to a string then to an array again? It seems like you should be comparing `raw.Length` with `zipped.Length` instead. – millimoose Aug 24 '12 at 19:07
  • Last, but not least, the gzip stream might contain a header of some sort. You should look into the relevant specifications to see whether you need to strip that off if you want just the size of the compressed payload. – millimoose Aug 24 '12 at 19:08
  • @millimoose thanks i thought something didnt look right let me check that –  Aug 24 '12 at 19:08
  • @millimoose if you want to post your correction of raw.Length and zipped.Length I'll accept that as an answer... the code did have that error as best i can tell and there is some header but i dont know how much it is but its kool all the same –  Aug 24 '12 at 19:25
  • Ah. Yeah, the reason why everything evaluates to 13 would be that bug then, now that I think about it, `Convert.ToString(raw)` probably returns some useless debugging output. http://ideone.com tells me it's `System.Byte[]` – millimoose Aug 24 '12 at 19:28

3 Answers3

2

Your bug is the following part where you convert the array into a string:

byte[] zipped = ms.ToArray(); // as a BLOB
string smallstring = Convert.ToString(zipped); // as a string
// store zipped or base64
byte[] raw2 = Encoding.UTF8.GetBytes(smallstring);

Calling Convert.ToString() on an array will return some debugging output, in this case the string System.Byte[]. (See the following example on ideone.)

You should compare the lengths of the uncompressed and compressed byte array directly:

int startsize = raw.Length;
int finishsize = zipped.Length;
millimoose
  • 39,073
  • 9
  • 82
  • 134
1

Here is a code that I used

/// <summary>
/// Defines an interface to calculate relevant 
/// to the input complexity of a string
/// </summary>
public interface IStringComplexity
{
    double GetCompressionRatio(string input);
    double GetRelevantComplexity(double min, double max, double current);
}

And the class that implements it

public class GZipStringComplexity : IStringComplexity
{
    public double GetCompressionRatio(string input)
    {
        if (string.IsNullOrEmpty(input))
            throw new ArgumentNullException();

        byte[] inputBytes = Encoding.UTF8.GetBytes(input);
        byte[] compressed;

        using (MemoryStream outStream = new MemoryStream())
        {
            using (var zipStream = new GZipStream(
                outStream, CompressionMode.Compress))
            {
                using (var memoryStream = new MemoryStream(inputBytes))
                {
                    memoryStream.CopyTo(zipStream);
                }
            }
            compressed = outStream.ToArray();
        }

        return (double)inputBytes.Length / compressed.Length;
    }

    /// <summary>
    /// Returns relevant complexity of a string on a scale [0..1], 
    /// where <value>0</value> has very low complexity
    /// and <value>1</value> has maximum complexity
    /// </summary>
    /// <param name="min">minimum compression ratio observed</param>
    /// <param name="max">maximum compression ratio observed</param>
    /// <param name="current">the value of compression ration
    /// for which complexity is being calculated</param>
    /// <returns>A relative complexity of a string</returns>
    public double GetRelevantComplexity(double min, double max, double current)
    {
        return 1 - current / (max - min);
    }
}

Here is how you can use it

class Program
{
    static void Main(string[] args)
    {
        IStringComplexity c = new GZipStringComplexity();

        string input1 = "HHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFH";
        string input2 = "mlcllltlgvalvcgvpamdipqtkqdlelpklagtwhsmamatnnislmatlkaplrvhitsllptpednleivlhrwennscvekkvlgektenpkkfkinytvaneatlldtdydnflflclqdtttpiqsmmcqylarvlveddeimqgfirafrplprhlwylldlkqmeepcrf";
        string inputMax = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";

        double ratio1 = c.GetCompressionRatio(input1); //2.9714285714285715
        double ratio2 = c.GetCompressionRatio(input2); //1.3138686131386861
        double ratioMax = c.GetCompressionRatio(inputMax); //7.5

        double complexity1 = c.GetRelevantComplexity(1, ratioMax, ratio1); // ~ 0.54
        double complexity2 = c.GetRelevantComplexity(1, ratioMax, ratio2); // ~ 0.80
    }
}

Some additional info that I found helpful.

You can try using LZMA, LZMA2 or PPMD from 7zip library. Those are relatively easy to set up and providing you have an interface you can implement several compression algorithms. I found that those algorithms perform much better compression than GZip, but if you put compression ratio on a scale this doesn't really matter.

If you need a normalised value for example from 0 to 1, you would need to calculate compression ratio for all the sequences first. This is because you can't be sure what is the max compression ratio possible.

oleksii
  • 35,458
  • 16
  • 93
  • 163
0

Sure, that will work. As long as you're just comparing sizes, it really doesn't matter what compression algorithm you use. Your main concern is just keeping an eye on the amount of processing power you're using to compress the strings.

Anthony Mills
  • 8,676
  • 4
  • 32
  • 51
  • Ah. Well your approach was a decent one (check compressed lengths) - I guess I didn't notice you had a bug in there. :) – Anthony Mills Aug 24 '12 at 23:31