1

I have a program that dispatches messages to separate processes. I need to balance the load, but not in very precise way, almost the same number is ok. Since every message has an uuid field, I want to do it by uuid value. After I tested the uuid randomness I found it to not be as random as I expexted. I have the last one and the first one about 80% difference. This is unacceptable, so I want to know if there is an algorithm that can make it more random.

Here is my test code.

import uuid
from collections import Counter

COUNT = 3000

def b(length):
    holder = []
    for i in xrange(COUNT):
        holder.append(str(uuid.uuid4())[:length])
    return Counter(holder)

def num(part_count):
    sep = 0xffffffffffffffffffffffffffffffff / part_count
    parts = []
    for i in xrange(COUNT):
#        str_hex = str(uuid.uuid4())[:4]
        num = int(uuid.uuid4().hex,16)
        divide = num/sep
        if divide == part_count:
            divide = part_count - 1
        parts.append(divide)
    return Counter(parts)

if __name__ == "__main__":
    print num(200) 

and I get the output like this:

Counter({127L: 29, 198L: 26, 55L: 25, 178L: 24, 184L: 24, 56L: 23, 132L: 23, 143L: 23, 148L: 23, 195L: 23, 16L: 21, 30L: 21, 44L: 21, 53L: 21, 97L: 21, 158L: 21, 185L: 21, 13L: 20, 146L: 20, 149L: 20, 196L: 20, 2L: 19, 11L: 19, 15L: 19, 19L: 19, 46L: 19, 58L: 19, 64L: 19, 68L: 19, 70L: 19, 89L: 19, 112L: 19, 118L: 19, 128L: 19, 144L: 19, 156L: 19, 192L: 19, 27L: 18, 41L: 18, 42L: 18, 51L: 18, 54L: 18, 85L: 18, 87L: 18, 88L: 18, 93L: 18, 94L: 18, 104L: 18, 106L: 18, 115L: 18, 4L: 17, 22L: 17, 45L: 17, 59L: 17, 79L: 17, 81L: 17, 105L: 17, 125L: 17, 138L: 17, 150L: 17, 159L: 17, 167L: 17, 194L: 17, 3L: 16, 18L: 16, 28L: 16, 31L: 16, 33L: 16, 62L: 16, 65L: 16, 83L: 16, 111L: 16, 123L: 16, 126L: 16, 133L: 16, 145L: 16, 147L: 16, 163L: 16, 166L: 16, 183L: 16, 188L: 16, 190L: 16, 5L: 15, 6L: 15, 9L: 15, 23L: 15, 26L: 15, 34L: 15, 35L: 15, 38L: 15, 69L: 15, 73L: 15, 74L: 15, 77L: 15, 82L: 15, 86L: 15, 107L: 15, 108L: 15, 109L: 15, 110L: 15, 114L: 15, 136L: 15, 141L: 15, 142L: 15, 153L: 15, 160L: 15, 169L: 15, 176L: 15, 180L: 15, 186L: 15, 0L: 14, 1L: 14, 36L: 14, 39L: 14, 43L: 14, 60L: 14, 71L: 14, 72L: 14, 76L: 14, 92L: 14, 113L: 14, 131L: 14, 135L: 14, 157L: 14, 171L: 14, 172L: 14, 181L: 14, 189L: 14, 7L: 13, 17L: 13, 20L: 13, 24L: 13, 25L: 13, 32L: 13, 47L: 13, 49L: 13, 101L: 13, 102L: 13, 117L: 13, 121L: 13, 122L: 13, 124L: 13, 130L: 13, 151L: 13, 152L: 13, 165L: 13, 179L: 13, 14L: 12, 21L: 12, 29L: 12, 50L: 12, 63L: 12, 67L: 12, 80L: 12, 84L: 12, 90L: 12, 91L: 12, 96L: 12, 120L: 12, 129L: 12, 139L: 12, 140L: 12, 182L: 12, 193L: 12, 197L: 12, 52L: 11, 75L: 11, 78L: 11, 103L: 11, 116L: 11, 119L: 11, 134L: 11, 137L: 11, 161L: 11, 173L: 11, 12L: 10, 37L: 10, 66L: 10, 98L: 10, 100L: 10, 162L: 10, 170L: 10, 175L: 10, 177L: 10, 187L: 10, 191L: 10, 199L: 10, 48L: 9, 155L: 9, 164L: 9, 174L: 9, 10L: 8, 95L: 8, 99L: 8, 168L: 8, 8L: 7, 40L: 7, 57L: 7, 61L: 7, 154L: 6})

the last one is 6 the first one is 29, nearly 5 times difference

antony.trupe
  • 10,640
  • 10
  • 57
  • 84
davyzhang
  • 2,419
  • 3
  • 26
  • 34
  • 6
    so you are saying you need to generate different looking UUIDs because you have some algorithm that does not properly distribute the uuid to different processors? If that is the case maybe you need a better load balancer, not a new uuid generator. – Tim Jul 04 '11 at 04:36
  • 1
    Your bins will be filled/hit as a normal distribution. What you are seeing is exactly what you should expect for normally distributed numbers. As you increase the number of samples this effect will be less pronounced. For 300000 samples I got 1290 of the most common and 1079 fro the least common – John La Rooy Jul 04 '11 at 05:15
  • 2
    This isn't an all bad question. He provides sufficient information to address his incorrect assumptions. With some grammar and formatting foo, this could be a decent question. – antony.trupe Jul 05 '11 at 02:48
  • UIDs aren't guaranteed to be random numbers, just unique. See http://blogs.msdn.com/b/oldnewthing/archive/2008/06/27/8659071.aspx – moonshadow Jul 05 '11 at 09:01

4 Answers4

8

UUIDs are not meant to be random, just unique. If your balancer needs to be keyed off of them, it should run them through a hash function first to get the randomness you want:

import hashlib
actually_random = hashlib.sha1(uuid).digest()
Jeremy
  • 1
  • 85
  • 340
  • 366
  • “UUIDs are not meant to be random, just unique” — this is a good point, but not exactly true of UUID4s: http://en.wikipedia.org/wiki/Universally_unique_identifier#Version_4_.28random.29 . UUID4 is all random except for one half byte which is 4, and one half byte which is one of 8, 9, A or B. – David Wolever Jul 04 '11 at 05:07
  • 6
    @David The point is accurate insofar as what UUIDs are _meant_ for. Random values are just a way of implementing uniqueness (with some degree of certainty). Being well-distributed certainly isn't a goal of UUIDs. – Nick Johnson Jul 04 '11 at 05:16
  • Ah, very true. I got so caught up in the excitement of random numbers I guess I forgot to read… – David Wolever Jul 04 '11 at 05:18
  • thanks you guys give me a lot ways to rethink my question :D, I finally dropped my idea using uuid as balance key – davyzhang Jul 11 '11 at 14:34
6

Your testing methodology doesn't make any sense (see below). But first, this is the implementation of uuid4:

def uuid4():
    """Generate a random UUID."""

    # When the system provides a version-4 UUID generator, use it.
    if _uuid_generate_random:
        _buffer = ctypes.create_string_buffer(16)
        _uuid_generate_random(_buffer)
        return UUID(bytes=_buffer.raw)

    # Otherwise, get randomness from urandom or the 'random' module.
    try:
        import os
        return UUID(bytes=os.urandom(16), version=4)
    except:
        import random
        bytes = [chr(random.randrange(256)) for i in range(16)]
        return UUID(bytes=bytes, version=4)

And the randomness returned by libuuid (the ctypes call), os.urandom and random.randrange should be good enough for most non-crypto stuff.

Edit: Ok, my guess as to why your testing methodology is broken: the number you're counting (divide) is biased in two ways: first, it's the result of dividing by a number which isn't a power of two (in this case, 200), which introduces modulo bias. Second, the if divide == part_count: divide = part_count - 1 introduces more bias.

Additionally, you'll need to figure out what the confidence interval is for any random number generator test before you can interpret the results. My stats-foo isn't great here, though, so I can't really help you with that…

David Wolever
  • 148,955
  • 89
  • 346
  • 502
  • Also, as I note on @j'B's comment, UUID4s have one byte which is not random (one half byte is always `4`, and one half byte is always one of `8`, `9`, `A`, or `B`). – David Wolever Jul 04 '11 at 05:10
  • 1
    Also, for more info on “real” tests random number generators are subjected to, see: http://en.wikipedia.org/wiki/Diehard_tests – David Wolever Jul 04 '11 at 05:12
  • 1
    David: Or [TestU01](http://www.iro.umontreal.ca/~simardr/testu01/tu01.html) which makes many Diehard-passing generators fail as well. – Joey Jul 05 '11 at 12:08
2

Well, UUID is not supposed to be random, it's supposed to be unique : usually, it's based on computer name/ip, date, stuff like that : the goal is not to make it random, the goal is to make sure that two successive calls will provide two different values and that Id from different computers won't collide. If you want more details, you can look at official spec (RFC 4122)

Now, if your load balancer want to use that as a criteria for balancing, I think your design is flawed. If you want a better randomness out of it, you can hash it (like sha-256), thus diluting the little randomness amongst all the bits (that's what a hash is doing)

Community
  • 1
  • 1
Bruce
  • 7,094
  • 1
  • 25
  • 42
1

Only because something doesn't look random, doesn't mean it isn't.

Maybe to the human eye (and mind) some sequences look less random than others, they are not. When you roll a dice 10 times, the probability to roll 2-5-1-3-5-1-3-5-2-6 is as high as rolling 1-1-1-1-1-1-1-1-1-1 or 1-2-3-4-5-6-1-2-3-4. Although the two latter examples seem to be less random, they are not.

Do not try to improve random generators as most probably you will only worsen the output.

For instance: You want to generate a random sequence and it doesn't look random enough to you that one byte appears more frequently than another. Hence you dismiss all sequences with repeated bytes ( or bytes repeated more than n times) in order to assure more randomness. Actually, you are making your sequences less random.

Hyperboreus
  • 31,997
  • 9
  • 47
  • 87
  • “There is not such a thing as more or less random.” this is entirely not true. See, eg: http://en.wikipedia.org/wiki/Diehard_tests – David Wolever Jul 04 '11 at 04:49
  • @David This describes the quality of pseudo random generators. Now please name me one of those generators whose output improves, when you start filtering for "random looking" output. – Hyperboreus Jul 04 '11 at 04:54
  • 1
    My comment wasn't addressing the statements you made about being “random looking” — it was addressing your claim that “there is no such thing as more or less random”. However, you appear to have edited that out, so my comment is no longer valid. – David Wolever Jul 04 '11 at 05:09