8

I need to understand if a string is sufficiently random or not. Can anyone point me in the right direction?

Background

I need to emulate a process behaviour, where a process copies itself to a temp location, renames itself to a random name, and executes itself. My ultimate goal is to detect such activity. As part of this work I need to test a process name, which is a string, for randomness. I understand that Kolmogorov complexity deals with this, but it is incomputable. What would be quick alternatives: variety of entropies, Lempel-Ziv compression level?

What I look for

string s1 = "test process name"
string s2 = "hgoi4dFh3e905jv"

double sensitivity = 0.5; // user-defined variable, a subjective threshold of randomness
bool b1 = SeemsRandom(s1, sensitivity);  // false
bool b2 = SeemsRandom(s2, sensitivity);  // true

bool SeemsRandom(string input, double sensitivity)
{
    ...
}
Community
  • 1
  • 1
oleksii
  • 35,458
  • 16
  • 93
  • 163
  • 1
    What comes to mind would be to have a dictionary of words and check how many words in the dictionary are a substring of the name, but I think that time complexity wise you could do better... – npinti Mar 25 '14 at 13:47
  • 9
    You're basically asking us to define "sufficiently random" and I think that is a bit too broad for SO. See also [How can I determine the statistical randomness of a binary string?](http://stackoverflow.com/questions/3097949/how-can-i-determine-the-statistical-randomness-of-a-binary-string) for a few pointers. – CodeCaster Mar 25 '14 at 13:47
  • 1
    i would try to find a way how vokals, konsonants and numbers are spread in relation to spaces – LunicLynx Mar 25 '14 at 13:48
  • 4
    That second string up there is my facebook password. – 500 - Internal Server Error Mar 25 '14 at 13:50
  • 1
    Calculate the Shannon entropy. Lots of hits when you google that. – Hans Passant Mar 25 '14 at 13:52
  • Comparing substrings against a dictionary is the one method which comes to mind. Another, more language-independent way would be to device an algorithm for determining if something is "word-like", and then testing substrings with this algorithm. – hyde Mar 25 '14 at 13:53
  • 2
    "test process name" is "tesztelési folyamat megnevezése" in Hungarian (via Google Translate). once you get the algorithm, please test it on http://en.wikipedia.org/wiki/Voynich_manuscript as well : ) – Konrad Morawski Mar 25 '14 at 13:53
  • 3
    Do you mean that the randomness of a string means that its unpresentable by any language? From strict binary point of view strings such as "This is my password" and "Yjod od åsddeptf" (qwerty shifted 1 letter to right) are quite the same in randomess, but another is understandable by english speakers. – Janne Matikainen Mar 25 '14 at 13:58
  • My recommendation is to compute metrics on the text. You might look at vowels versus consonants. You can also look at the number repeated letters or frequency of certain common digraphs ("TH", "SH", etc). I think combining these will help get you closer to calculating the "randomness" of a letter sequence – drew_w Mar 25 '14 at 14:00
  • Vowels versus entire text length could be of some use. It's 0.3246 for your question (excluding the code excerpt), 0.2941 for "test process name" but only 0.2 for "hgoi4dFh3e905jv". Vowels are more prevalent in natural speech than their proportion in the alphabet would indicate. I think it applies to most human languages (except for written Arabic maybe? I'm not a linguist). See http://en.wikipedia.org/wiki/Letter_frequency - 5 most popular letters in English are e, t, a, o, i. But you will never know for sure. – Konrad Morawski Mar 25 '14 at 14:01
  • 1
    As an aside, to generate those random file names .NET has a built in function [`Path.GetRandomFileName()`](http://msdn.microsoft.com/en-us/library/system.io.path.getrandomfilename(v=vs.110).aspx), that will generate a cryptographically strong random file name for you. – Scott Chamberlain Mar 25 '14 at 14:08
  • I don't think it would really work for short strings. See http://www.shannonentropy.netmark.pl/calculate - metric entropy for `hgoi4dFh3e905jv` is 0.25157 and for `Konrad Morawski` it's 0.23379 – Konrad Morawski Mar 25 '14 at 14:20
  • So, Konrad, is that your real name? ;) – 500 - Internal Server Error Mar 25 '14 at 14:24
  • Compression would be the traditional approach, since the point is to minimize redundancy. The larger the possible reduction in size, the less entropy. – John C Mar 25 '14 at 14:32
  • 4
    By birth it's yjco4fs0.evj but mum called me `Konrad'); DROP TABLE Students;` for short. Obligatory http://xkcd.com/327/ :) – Konrad Morawski Mar 25 '14 at 14:33

1 Answers1

1

You may want to try converting the string to a binary sequence and try using the Wald-Wolfowitz runs test which should be less complicated than Kolmogorov–Smirnov test

http://en.wikipedia.org/wiki/Wald%E2%80%93Wolfowitz_runs_test

ommehta
  • 11
  • 2