I have a large theoretical string (104 characters long) database generation program that returns results measured in petabytes. I don't have that much computing power so I would like to filter the low complexity strings from the database.
My grammer is a modified form of the English alphabet with no numerical characters. I read about Kolmogorov Complexity and how it is theoretically impossible to calculate but I just need something basic in C# using compression.
Using these two links
I came up with this:
MemoryStream ms = new MemoryStream();
GZipStream gzip2 = new GZipStream(ms, CompressionMode.Compress, true);
byte[] raw = Encoding.UTF8.GetBytes(element);
gzip2.Write(raw, 0, raw.Length);
gzip2.Close();
byte[] zipped = ms.ToArray(); // as a BLOB
string smallstring = Convert.ToString(zipped); // as a string
// store zipped or base64
byte[] raw2 = Encoding.UTF8.GetBytes(smallstring);
int startsize = raw.Length;
int finishsize = raw2.Length;
double percent = Convert.ToDouble(finishsize) / Convert.ToDouble(startsize);
if (percent > .75)
{
///output
}
my first element is:
HHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFH
and it compresses to a finishsize of 13 characters but this other chatcter set
mlcllltlgvalvcgvpamdipqtkqdlelpklagtwhsmamatnnislmatlkaplrvhitsllptpednleivlhrwennscvekkvlgektenpkkfkinytvaneatlldtdydnflflclqdtttpiqsmmcqylarvlveddeimqgfirafrplprhlwylldlkqmeepcrf
also evaluates to 13. There is a bug but I don't know how to fix it.