Recently I decided to investigate the degree of randomness of a globally unique identifier generated with the Guid.NewGuid
method (which is also the scope of this question). I documented myself about pseudorandom numbers, pseudorandomness and I was dazzled to find out that there are even random numbers generated by radioactive decay. Anyway, I'll let you discover by yourselves more details about such interesting lectures.
To continue to my question, another important thing to know about GUID's is:
V1 GUIDs which contain a MAC address and time can be identified by the digit "1" in the first position of the third group of digits, for example {2F1E4FC0-81FD-11DA-9156-00036A0F876A}.
V4 GUIDs use the later algorithm, which is a pseudo-random number. These have a "4" in the same position, for example {38A52BE4-9352-453E-AF97-5C3B448652F0}.
To put it in a sentence, a Guid
will always have the digit 4 (or 1, but out of our scope) as one of its components.
For my GUID randomness tests I decided to count the number of digits inside some increasingly large collection of GUIDs and compare it with the statistical probability of digit occurrence, expectedOccurrence
. Or at least I hope I did (please excuse any statistical formula mistakes, I only tried my best guesses to calculate the values). I used a small C#
console application which is listed below.
class Program
{
static char[] digitsChar = "0123456789".ToCharArray();
static decimal expectedOccurrence = (10M * 100 / 16) * 31 / 32 + (100M / 32);
static void Main(string[] args)
{
for (int i = 1; i <= 10; i++)
{
CalculateOccurrence(i);
}
}
private static void CalculateOccurrence(int counter)
{
decimal sum = 0;
var sBuilder = new StringBuilder();
int localCounter = counter * 20000;
for (int i = 0; i < localCounter; i++)
{
sBuilder.Append(Guid.NewGuid());
}
sum = (sBuilder.ToString()).ToCharArray()
.Count(j => digitsChar.Contains(j));
decimal actualLocalOccurrence = sum * 100 / (localCounter * 32);
Console.WriteLine(String.Format("{0}\t{1}",
expectedOccurrence,
Math.Round(actualLocalOccurrence,3)
));
}
}
The output for the above program is:
63.671875 63.273
63.671875 63.300
63.671875 63.331
63.671875 63.242
63.671875 63.292
63.671875 63.269
63.671875 63.292
63.671875 63.266
63.671875 63.254
63.671875 63.279
So, even if the theoretical occurrence is expected to be 63.671875%
, the actual values are somewhere around ~63.2%
.
How can this difference be explained? Is there any error in my formulas? Is there any other "obscure" rule in the Guid
algorithm?