10

Recently I decided to investigate the degree of randomness of a globally unique identifier generated with the Guid.NewGuid method (which is also the scope of this question). I documented myself about pseudorandom numbers, pseudorandomness and I was dazzled to find out that there are even random numbers generated by radioactive decay. Anyway, I'll let you discover by yourselves more details about such interesting lectures.

To continue to my question, another important thing to know about GUID's is:

V1 GUIDs which contain a MAC address and time can be identified by the digit "1" in the first position of the third group of digits, for example {2F1E4FC0-81FD-11DA-9156-00036A0F876A}.

V4 GUIDs use the later algorithm, which is a pseudo-random number. These have a "4" in the same position, for example {38A52BE4-9352-453E-AF97-5C3B448652F0}.

To put it in a sentence, a Guid will always have the digit 4 (or 1, but out of our scope) as one of its components.

For my GUID randomness tests I decided to count the number of digits inside some increasingly large collection of GUIDs and compare it with the statistical probability of digit occurrence, expectedOccurrence. Or at least I hope I did (please excuse any statistical formula mistakes, I only tried my best guesses to calculate the values). I used a small C# console application which is listed below.

class Program
{
    static char[] digitsChar = "0123456789".ToCharArray();
    static decimal expectedOccurrence = (10M * 100 / 16) * 31 / 32 + (100M / 32);
    static void Main(string[] args)
    {
        for (int i = 1; i <= 10; i++)
        {
            CalculateOccurrence(i);
        }
    }

    private static void CalculateOccurrence(int counter)
    {
        decimal sum = 0;
        var sBuilder = new StringBuilder();
        int localCounter = counter * 20000;
        for (int i = 0; i < localCounter; i++)
        {
            sBuilder.Append(Guid.NewGuid());
        }

        sum = (sBuilder.ToString()).ToCharArray()
                  .Count(j => digitsChar.Contains(j));

        decimal actualLocalOccurrence = sum * 100 / (localCounter * 32);

        Console.WriteLine(String.Format("{0}\t{1}",
            expectedOccurrence,
            Math.Round(actualLocalOccurrence,3)
            ));
    }
}

The output for the above program is:

63.671875       63.273
63.671875       63.300
63.671875       63.331
63.671875       63.242
63.671875       63.292
63.671875       63.269
63.671875       63.292
63.671875       63.266
63.671875       63.254
63.671875       63.279

So, even if the theoretical occurrence is expected to be 63.671875%, the actual values are somewhere around ~63.2%.

How can this difference be explained? Is there any error in my formulas? Is there any other "obscure" rule in the Guid algorithm?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Alex Filipovici
  • 31,789
  • 6
  • 54
  • 78
  • 2
    +1 Its way beyond me, put I know a question like this takes a lot of time to build, and it is a very good question. Good Luck. – FrostyFire Jan 30 '13 at 02:35
  • 1
    So you're generating 20000 GUIDs, concatenate them in hexadecimal format, and count the percentage of decimal digits in the string. Shouldn't the expected result be 10/16=62.5%, ignoring the fact that not all bits of a GUID are random? – dtb Jan 30 '13 at 02:41
  • 2
    first of all, why? second of all, where do you get random from GUID? there is no R in GUID. third of all, http://blogs.msdn.com/b/oldnewthing/archive/2008/06/27/8659071.aspx. there isn't enough space on here to go into the historical reasons for guid, uuid, etc. let's just say that randomness was never a motivation. – thang Jan 30 '13 at 02:48
  • @dtb, that was also my original estimation. But then I realized that the probability is influenced by the fact that one of the 32 hexadecimals is always 4. `expectedOccurrence`'s formula should hold the explanation of my estimation. – Alex Filipovici Jan 30 '13 at 02:50
  • 3
    GUIDs generated from random numbers contain 6 fixed bits saying they are random and 122 random bits. But one hexadecimal digit represents only 4 bits, so it's not just the 4 that skews the result. I don't understand your formula at all. – dtb Jan 30 '13 at 02:55
  • How do you know enough to have an expectation of the probability? The specification says nothing about the distribution of the random variable, so implementation is up to the imagination. I guess you can dig into UuidCreate to get the exact details. Even the quote you have says "pseudo-random number", which you assumed to mean uniform pseudo-random number. – thang Jan 30 '13 at 05:22
  • 2
    You might want to read my lengthy series of articles on the uses and abuses of GUIDs; it begins here: http://blogs.msdn.com/b/ericlippert/archive/2012/04/24/guid-guide-part-one.aspx – Eric Lippert Jan 30 '13 at 05:44

2 Answers2

9

In the version 4 GUID, the first character in the third group is 4. The first character in the fourth group is one of 8, 9, a, or b. The specification does not say anything about how that first character in the fourth group is generated. That could be throwing off your results.

If you want to investigate further, you need to keep track of how often each hexadecimal digit appears in each position. I suspect that will reveal the difference, and help you to determine whether your theoretical estimate is off, or the pseudo-random algorithm is slightly biased.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • Of course, that explains it! The `y` hexadecimal in `xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx` has a fixed `50%` probability of being a digit. Probably I should have read the specification for [Version 4 (random) of UUID](http://en.wikipedia.org/wiki/Universally_Unique_Identifier#Version_4_.28random.29) first. Thanks! – Alex Filipovici Jan 30 '13 at 08:40
7

Jim got it (I just found this question whose answer that gave the same incite into v4 guid generation).

So by altering the expected equation with this new knowledge, you get: ((10/16)*30+1+0.5)/32 or (10M * 100 / 16) * 30 / 32 + (150M / 32), which is about 63.28%, pretty close to the experimental data you were getting.

Community
  • 1
  • 1
Cemafor
  • 1,633
  • 12
  • 27