0

When a node joins a DHT network, it seems optimal for the new node to evenly divide the largest interval on the consistent hash's circle in order to minimize remapping. However, this is only optimal for 2n nodes (assuming with start with n=1); all other numbers create hotspots if the keys are uniformly accessed:

  • n=2, 1/2 1/2, optimal
  • n=3, 1/4 1/4 1/2 , hotspot with 1/3 of nodes serving 1/2 of traffic
  • n=4, 1/4 1/4 1/4 1/4, optimal
  • n=5, 1/8 1/8 1/4 1/4 1/4, hotspot with 3/5 of nodes serving 3/4 of traffic

An approach that minimizes hotspots while incurring more remapping would be to redistribute the new nodes evenly:

  • n=2, 1/2 1/2
  • n=3, 1/3 1/3 1/3

With an implementation of that like the one below, some fairly low number of elements are remapped (not sure if it's actually minimized), hotspots are eliminated, and the basic consistent hashing algorithm is preserved.

// 10 perfectly distributed hash keys, later referred to as a-j
var hashKeys = [0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95];

for (var kNodeCount = 1; kNodeCount < 5; kNodeCount++) {
 var buckets = [];
 for (var k = 0; k < kNodeCount; k++) buckets[k] = [];
 // Distribute keys to buckets:
 for (var i = 0; i < hashKeys.length; i++) {
  var hashKey = hashKeys[i];
  var bucketIndex = Math.floor(hashKey * kNodeCount);
  buckets[bucketIndex].push(hashKey);
 }
 console.log(kNodeCount, buckets);
}

The transitions from that (letters instead of numbers) are:
[abcdefghij] -> [abcde][fghij] -> [abc][defg][hij] -> [ab][cde][fg][hij]

Are there other/better solutions to this (is this a solved problem)? I'm relatively new to DHTs and distributed algorithms in general, but I haven't found this addressed in any DHT/p2p/distributed algorithm I've read about. In my specific scenario, minimizing hotspots is critical while minimizing remappings is less expensive.

ZachB
  • 13,051
  • 4
  • 61
  • 89

1 Answers1

1

You can notice that with grow of n the difference in load between the hotspots and the optimal nodes decreases, so the common solution is to introduce a lot of virtual nodes (to artificially increase the n value) and make real nodes to host several virtual nodes to help distribute data more evenly.

It's a common practice in industry, for example Riak and Cassandra use it. You can read about it here:

Community
  • 1
  • 1
rystsov
  • 1,868
  • 14
  • 16