0

My understanding of a distributed hash table is that every node can be identified uniquely by a nodeId and can store information, like host, port and values. Every node stores other nodeIds in (a) lookup table(s) and finding another node can be made as efficient as log(n) with system size n. In order to retrieve the value from a node, one would need a key. Is the key for a value just the nodeId (i.e. a content identifier or hash of the value)? If so, then every node can only save one value? Or can a nodeId store multiple key-values, in which case the question arises how to retrieve a value without knowing which node contains which keys.

Marlo
  • 207
  • 1
  • 4
  • 12
  • 1
    Nodes store data for keys near them, other nodes will send them a STOR message for values near their node ID. The node ID itself is a random value. – Anon Coward Mar 07 '22 at 19:37
  • I see, thanks. Is there any ref that spcifies which nodes are eligible to store a value for a given key? Is an XOR distance of 4 okay? Maybe in a smaller networks the closest node has a distance of 80 bits. As it grows the value is still stored in this far away node. A lookup will not only take very long but might not converge, since a completely wrong corner of the network is explored. – Marlo Mar 07 '22 at 23:12
  • 1
    Any node is eligible to store any key. A node is meant to instruct the closest node that it finds when looking up a key that didn't know the key what the key/value is, so in a well behaving well functioning network, the closest node to a key will always know that key, along with at least one other backup node. This is all laid out in the [Kademlia white paper](https://pdos.csail.mit.edu/~petar/papers/maymounkov-kademlia-lncs.pdf) – Anon Coward Mar 07 '22 at 23:38

1 Answers1

0

There are two general approaches for building distributed hash tables: distribute keys to nodes based on key values or based on hashes of keys. There are pros and cons for either option.

In either way, we will need to find a nodeId for a given key. A naive implementation could use modulus operation to find a target node. The problem with this approach is that when we do change number of nodes, then we have to remap absolutely every key - this is slow.

Let's say we have 4 nodes and the key is 10. 10 mod 4 is 2 an we will pick node #2 to store the data.

If you do need to handle node addition/removal efficiently, the common approach is called Consistent Hashing - that works similar to what you have described - we have multiple key (or hashes of key) intervals and assign many of those intervals to nodes - we require number of this intervals to be much larger than number of nodes. Hence if we add or remove a node, only some of those intervals need to be remapped to other nodes.

AndrewR
  • 1,252
  • 8
  • 7