3

Circular hashing algorithms provide consistency given a static set of targets. For instance:

  1. I have an initial set of targets, let's call them A, B and C.
  2. I have a key, let's call it x
  3. I have a circular hashing function, let's call it hash(key, targets)
  4. When I call hash(x, [A,B,C]), x always hashes to A

Seems obvious enough. The fact that I always get A given x represents the consistency I expect when using circular hashes. However, let's now consider what happens if I add a new node D:

  1. My target set is rebalanced to include A, B, C, and D
  2. I reapply my key x to hash(x, [A,B,C,D])
  3. Because the circle is rebalanced, I am not guaranteed to get A anymore

Am I missing something or am I just out of luck? The problem is further exacerbated when you start reordering the nodes (e.g. hash(x, [B,A,D,C])) or if you insert a new node in the middle of an existing node list (e.g. hash(x, [A,AA,B,C,D])). I've looked a bit into the academic side of circular hashing and this type of "scaling consistency" doesn't seem to be one of its primary concerns. Maybe I'm just using the wrong type of hashing algorithm?

  • Sorry, I could not find much material on circular hashing. Do you mean consistent hashing and DHT? – Nayuki Mar 08 '13 at 03:35
  • Your answer aptly explains what I understand as "circular hashing". However, I don't think I'm any closer to a real solution. It's easy to solve this problem with a notion of "memory", i.e. remembering what past keys hashed where. However, it might just be impossible to build the memory into the algorithm. The algorithm has to mutate after a key is hashed. Adding new targets should not affect where past keys were hashed. – Jonathan Azoff Mar 08 '13 at 05:57
  • "Because the circle is rebalanced, I am not guaranteed to get A anymore" So you're saying that for already-hashed items, they should never be distributed to new nodes? / "The problem is further exacerbated when you start reordering the nodes" But in the model that I presented in the answer, there is no such thing as reordering the nodes because the set of nodes is unordered. Can you address these two concerns? – Nayuki Mar 08 '13 at 14:28
  • Sure. So for your question about already hashed items, yes they should never be hashed to new nodes. In other words, once a key is hashed to a node, it should always hash to that node. For your second question, you are right that order does not affect your algorithm. However, in my attempts to make this work, I used am ordered list and even distribution around the circle. Hence, to keep the arcs even when a new node is added, the existing nodes all have to shift (i.e. "rebalancing"). Hopefully, that adds more clarity, and not less :) – Jonathan Azoff Mar 08 '13 at 18:02

4 Answers4

1

There is quite simple solution for your problem. Here is an example of how it works.

Lets assume you have 3 real targets (i.e. physical machines): A, B, C. Then you introduce 9 virtual targets: 1, 2, 3, 4, 5, 6, 7, 8, 9 and establish static mapping from virtual target to real target like this:

1, 2, 3 -> A
4, 5, 6 -> B
7, 8, 9 -> C

When you need to read/write value for some key, you first map the key to virtual target using hash function and then map virtual target to real target using static mapping shown above. Once some real target serves several virtual target, it should store them in separate hash maps, so real target B has three separate hash maps for three virtual targets it serves.

Now we want to add new real target: D. We first rebalance our static mapping, e.g. like this:

1, 2, 3 -> A
4, 5 -> B
7, 8 -> C
6, 9 -> D

Then we transfer hash map that serves virtual target 6 from real target B to new real target D. Also we transfer map serving virtual target 9 from C to D. This operation has complexity O(n) where n is number of values transferred, because each real target serves each virtual target in separate hash map.

To have good load balancing, number of virtual targets should be several times (e.g. 10 times) greater than estimation of maximum possible number of real targets.

In other words, main idea of the solution is that hash function is used to map key to virtual target where number of virtual targets does not change. Then static mapping is used to map virtual target to real target and this static mapping changes when real targets are added or removed.

Mikhail Vladimirov
  • 13,572
  • 1
  • 38
  • 40
  • I don't think this will work. Here's why: Using your first example, assume `x` hashes to virtual target `6` which hashes to real target `C`. Now let's rebalance, as in your second example. After the rebalance, `x` still maps to virtual target `6`, which is - in fact - consistent. However, virtual target `6` does not map to real target `C` any longer. While I may have achieved better distribution, I have failed at maintaining the correlation between `x` and real target `C`. Keeping the correlation from the key to the real target, as we add new targets, is my goal here; not better distribution. – Jonathan Azoff Mar 08 '13 at 20:24
  • As I mentioned before, rebalancing may require moving virtual targets between real targets, which means transferring whole hash maps with all the data from one real target to another. – Mikhail Vladimirov Mar 08 '13 at 21:02
0

As you expand the allowable output range of the hash function, it stands to reason that some inputs will then hash to different outputs (otherwise there was no point expanding the range). The only way it can be otherwise is if the hash function stores all previous results (or a compressed, possibly lossy form of the same, like a Bloom filter), so that it can remember to use the "old" result for inputs it has seen before.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
0

I couldn't interpret your entire question in a consistent way, so I will guess what you really wanted to ask and answer based on that.

Assumed problem: You have a bunch of objects (e.g. strings) and you have a bunch of machines, and you want to assign each object to a machine in order to spread the workload among the machines. When a machine joins or leaves the pool of machines, you don't want to reshuffle too many of the object-to-machine assignments ("scaling consistency").

I think you have a misunderstanding where you said you hash the object x to map a machine in the pool [A,B,C]. My understanding is that there are three intermediate steps involved.

  1. Calculate a hash value for each object. Suppose that the hash output space is something large like all integers from 0 to 232 − 1.

  2. Assign a value (in the same number space) to each machine, which it keeps constant for its lifetime. You would want to spread these numbers out randomly.

  3. Now, we assign each object to belong to the closest upward machine. That means if the object's hash is x, then it belongs to the machine M such that M's value is the smallest number greater than x.

Example:

  1. We have 4 string objects with their respective hash in the range 0 to 999: abc=314, def=125, ghi=802, jkl=001.

  2. We have 3 machines, with these numbers: X=010, Y=357, Z=768.

  3. Which machine does object abc=314 belong to? Counting upwards, the closest machine is Y=357.
    Which machine does object ghi=802 belong to? Counting upwards, the closest machine is X=010.

Nayuki
  • 17,911
  • 6
  • 53
  • 80
  • There is nothing about this response that is wrong. You're assumptions about my meaning are completely correct. However, I'm not sure if this actually answers my question. – Jonathan Azoff Mar 08 '13 at 06:22
0

Ok, I think I got it.

I ended up keeping the hashing algorithm simple, and using a "checksum" (of sorts) to ensure that x always keys to the same target. When a new target is added, and the system rebalances, I simply inform all the existing targets about the rebalance. This way, if x hashes to a target it should no longer hash to, the target can then just delegate to the correct one.

Thank you for all the replies, I might not have come to this solution were it not for the clarity that you all provided.

Cheers,

Jon