3

I'm using hash_ring package for distributing objects among servers. I've assumed that distribution would be uniform, as it's based on MD5 hashes. Unfortunately it's not the case.

I'm using random keys which are generated using uuid.uuid4(). I've verified, that MD5 itself in fact does give uniform distribution. However, when I'm distributing using hash_ring.HashRing there are differences of 20-30% between most and least populated buckets.

  • Can uniformity of distribution be improved in hash_ring by tweaking some settings?
  • Are there other good alternatives to do consistent hashing in Python?

The code I've used to test uniformity of distribution:

ring = hash_ring.HashRing(range(8))
for _ in range(10):
     counters = [0]*8
     for _ in range(100000):
         counters[ring.get_node(str(uuid.uuid4()))] += 1
     print counters

Which printed out:

[11115, 11853, 14033, 11505, 13640, 12292, 12851, 12711]
[11164, 11833, 14024, 11562, 13365, 12302, 13002, 12748]
[11354, 11756, 14017, 11583, 13201, 12231, 13135, 12723]
[11182, 11672, 13936, 11441, 13563, 12240, 13129, 12837]
[11069, 11866, 14045, 11541, 13443, 12249, 12894, 12893]
[11196, 11791, 14158, 11533, 13517, 12319, 13039, 12447]
[11297, 11944, 14114, 11536, 13154, 12289, 12990, 12676]
[11250, 11770, 14145, 11657, 13340, 11960, 13161, 12717]
[11027, 11891, 14099, 11615, 13320, 12336, 12891, 12821]
[11148, 11677, 13965, 11557, 13503, 12309, 13123, 12718]

For comparison I've did non-consistent hashing directly using MD5:

for _ in range(10):
    counters = [0]*8
    for _ in range(100000):
        counters[int(hashlib.md5(str(uuid.uuid4())).hexdigest(),16)%8] += 1
    print counters

with much better results:

[12450, 12501, 12380, 12643, 12446, 12444, 12506, 12630]
[12579, 12667, 12523, 12385, 12386, 12445, 12702, 12313]
[12624, 12449, 12451, 12401, 12580, 12449, 12562, 12484]
[12359, 12543, 12371, 12659, 12508, 12416, 12619, 12525]
[12425, 12526, 12565, 12732, 12381, 12481, 12335, 12555]
[12514, 12576, 12528, 12294, 12658, 12319, 12518, 12593]
[12500, 12471, 12460, 12502, 12637, 12393, 12442, 12595]
[12583, 12418, 12428, 12311, 12581, 12780, 12372, 12527]
[12395, 12569, 12544, 12319, 12607, 12488, 12424, 12654]
[12480, 12423, 12492, 12433, 12427, 12502, 12635, 12608]
vartec
  • 131,205
  • 36
  • 218
  • 244
  • I think you should contact the author of the package. I read over the blog http://www.lexemetech.com/2007/11/consistent-hashing.html and think it is not imediately obvious why you noticed this behaviour. Maybe it is just a trade off between the good properties of consistent hashing and the resulting uniformity of the distribution. – rocksportrocker Aug 22 '11 at 17:56
  • it is expected and the replicas parameter specifically addresses this. – andrew cooke Aug 22 '11 at 18:27
  • I'm using the latest version, which is this one: https://github.com/amix/hash_ring/blob/master/hash_ring/hash_ring.py Number of replicas seems to be hard-coded :-( – vartec Aug 22 '11 at 23:28
  • sorry, i was wrong - i was looking at the old code. for the new code i do not know if this is as good as it gets or not. i agree with @rocksportrocker that it may be worth asking if this is normal (looking at the blog, i think it may be). – andrew cooke Aug 23 '11 at 03:10

1 Answers1

9

the hash ring sacrifices the "eveness" of your md5 test code to maintain mappings when the number of entries changes. see http://www.lexemetech.com/2007/11/consistent-hashing.html. so the differences you see are not because of uuid4, or because of an error, but because the library uses a different algorithm from your test.

if you changed the number of bins in your md5 code you'd need to change the modular division (the % 8) and suddenly (almost) all mappings would change. consistent hashing avoids this. that is why it can't use the same "obviously uniform" approach you do.

the downside of the consistency approach is that it's not perfectly uniform (it depends on the hashes of the the bins, rather than on the hashes of the objects you put in the bins, so you don't get the "evening out" you would expect as you add more objects). but you can increase uniformity by increasing the number of points on the ring - by increasing the number of "replicas" used in the code.

so assuming that the final api matches that shown at http://amix.dk/blog/viewEntry/19367 just set a larger value for replicas (try doubling it, or even just adding 1 - you're already pretty flat).


update: i looked around a bit more and this blog post http://amix.dk/blog/post/19369 describes the changes made to the latest code. it looks like that does use more replicas than 3 (i don't completely understand the code, sorry) and it also seems that it's now based on a well-known "standard" implementation.

andrew cooke
  • 45,717
  • 10
  • 93
  • 143
  • I'm accepting this as an answer, as indeed seems that increasing number of replicas is the way to go for me. As for the `hash_ring` API, unfortunately in the latest version number of replicas is hard-coded (exactly 3). – vartec Aug 22 '11 at 23:26
  • seems I've misunderstood code a little, actually number of replicas is `math.floor((40*len(self.nodes)*weight) / total_weight)` where weight of the server by default is `1`. – vartec Aug 23 '11 at 08:44
  • How to choose the value of `replicas` ? – Jerry Chin Oct 30 '19 at 06:32