4

I'm following this formula from wikipedia:

H(i, k) = (H1(k) + i*H2(k)) % size

and my H1 is Python's built-in hash() function.

H2 is:

PRIME - (H1(k) % PRIME)

Unfortunately it randomly sticks in an infinite loop after a couple of execution. It cannot traverse all the slots in my table.

Here is my code but you have to set PYTHONHASHSEED=12 in order to reproduce this bug. (I deliberately removed many details so that the implementation would be minimal)

EMPTY = object()

class DoubleHashingHashMap:
    def __init__(self):
        self.prime = 7
        self.size = 15
        self.slots = [EMPTY] * self.size

    def __setitem__(self, key, value):
        for idx in self.probing_squence(key):
            slot = self.slots[idx]
            if slot is EMPTY:
                self.slots[idx] = (key, value)
                break
            elif isinstance(slot, tuple):
                k, v = slot
                if k == key:
                    self.slots[idx] = (key, value)
                    break

    def probing_squence(self, key):
        h1 = self.hash_func1(key) % self.size
        h2 = self.hash_func2(key) % self.size
        i = 1
        while True:
            yield (h1 + i*h2) % self.size
            i += 1

    def hash_func1(self, item):
        return hash(item)

    def hash_func2(self, item):
        return self.prime - (self.hash_func1(item) % self.prime)

hashmap = DoubleHashingHashMap()
for i in range(8):
    hashmap[str(i)] = i
print("8 items added.")
print("Going into the infinite loop when adding 9th item(which is 8)...")
hashmap["8"] = 8
print("This line can't be reached.")

I would appreciate if you tell me what's wrong with my math.

S.B
  • 13,077
  • 10
  • 22
  • 49
  • When you used a debugger to see *why* it gets stuck, what did you find? – Scott Hunter Jun 28 '23 at 13:20
  • 1
    Just staring down the code, this would be the case when `h2` is a multiple of `self.size`. Then the `while True` loop keeps yielding the same value over and over and the consuming loop in `__setitem__` has no else part, so may never break. What if `k != key`? – user2390182 Jun 28 '23 at 13:23
  • @user2390182 if `k != key`, that slot is occupied by another key with (most likely) different hash that happens to be in the same as the current key. The code continues to the next iteration. – S.B Jun 28 '23 at 13:31
  • @S.B Yeah, but when h2 is multiple of size, every future iteration will probe the same index. – user2390182 Jun 28 '23 at 13:32
  • @user2390182 You're right. How can I fix that? Does that mean my `h2` is entirely incorrect? – S.B Jun 28 '23 at 13:33
  • Yeah, you should ensure (like in doing some math =) ) that this cannot be the case. Ideally your probing sequence should be able to reach all indeces. So h2 and size should be as coprime as possible. Or you can change the probing sequence to not use these linear increments. – user2390182 Jun 28 '23 at 13:36
  • @ScottHunter Sorry I didn't respond. The behavior is just like what user2390182 explained. – S.B Jun 28 '23 at 13:36
  • @user2390182 Interesting. Thanks for your explanation. Actually I got the implementation of `H2` from many online websites. They all proposed this but apparently it doesn't work. If you come across a valid `H2` implementation please send that reference here. – S.B Jun 28 '23 at 14:20
  • It may just be a different choice of size. 15 is quite small# – user2390182 Jun 28 '23 at 14:23
  • @user2390182 In my actual code, I used 64 as the initial value and I resize the table when it reaches the load factor. I just removed these details here for simplicity. The problem occurs even when the initial size is 64. – S.B Jun 28 '23 at 14:25

1 Answers1

2

The logic calculating the sequence is flawed. For the configuration you mentioned it just will output 0, 5, 10 forever since the 0, 5, 10 slots are already occupied this will go on forever. You only multiply h2 with i and do the modulo with the size. This will loop quite often through a few specific values and won't cover all possible indexes.

This is what happens in your case

# h1 = 10, h2 = 5, calculating the first 10 outputs you would get
print((10 + np.arange(10) * 5) % 15)
array([10,  0,  5, 10,  0,  5, 10,  0,  5, 10])

So this actually loops through only 3 values, quite bad with 15 possible ones. Probably the reason why this bug happens so fast.

With how your implement it you can just increase the index by one and do this until a slot is empty and in the __getitem__ you need to check if the key requested matches the key in the slot and if not do the same logic by increasing it by one until you find it.

EMPTY = object()


class DoubleHashingHashMap:
    def __init__(self):
        self.prime = 7
        self.size = 15
        self.slots = [EMPTY] * self.size

    def __setitem__(self, key, value):
        for idx in self.probing_squence(key):
            slot = self.slots[idx]
            if slot is EMPTY:
                self.slots[idx] = (key, value)
                break
            elif isinstance(slot, tuple):
                k, v = slot
                if k == key:
                    self.slots[idx] = (key, value)
                    break

    def __getitem__(self, key):
        for idx in self.probing_squence(key):
            slot = self.slots[idx]
            if slot is not EMPTY and slot[0] == key:
                return slot[1]

    def probing_squence(self, key):
        h1 = self.hash_func1(key) % self.size
        h2 = self.hash_func2(key) % self.size
        i = 0
        while True:
            yield (h1 + h2 + i) % self.size
            i += 1

    def hash_func1(self, item):
        return hash(item)

    def hash_func2(self, item):
        return self.prime - (self.hash_func1(item) % self.prime)


hashmap = DoubleHashingHashMap()
for i in range(8):
    hashmap[str(i)] = i
print("8 items added.")
print("Going into the infinite loop when adding 9th item(which is 8)...")
hashmap["8"] = 8
print("This line can't be reached.")
print(hashmap["1"], hashmap["8"])

So this fixes it but probably not in the way you want since you reference the wikipedia.

So why does the formula from wikpedia not work in your case. This is probably because your h2 does not have all needed characteristics.

The Wikipedia you linked says

The secondary hash function h2(k) should have several characteristics:

  • it should never yield an index of zero

  • it should be pair-wise independent of h1k

  • it should cycle through the whole table

  • All h2(k) be relatively prime to the size

Your h2 actually has only the first characteristics. It can't be 0. It is definitely dependent on h1 since you use h1 to calculate h2. It won't cycle through the whole table since your self.prime < self.size. It can definitely output e.g. 5, which is not relative prime to a total size of 15. They both share the factor of 5.

As said in the article to e.g. have the relative prime characteristic you can have the total size be a power of 2 and only ever return odd numbers from h2. This will automatically make it relatively prime. You should not use h1 to calculate h2 to make them independent and make sure the outputs of h2 are in the interval [1, size - 1].

So if you want to apply the hashing rule you need to make sure your h2 actually has the characteristics needed. Otherwise the closed loop of a few numbers will happens as you observed.

Nopileos
  • 1,976
  • 7
  • 17
  • Thank for the explanation. You're right my `H2` function was not appropriate for that formula. With that simple change it now works perfectly. – S.B Jun 28 '23 at 17:37
  • 1
    Note that I did not change anything according to the wikipedia, so my code is not what is described. It was just a quick fix you can normally do when implementing a hashmap, but it is not very efficient if you have a lot of collisions. Double hashing is better since you step a random amount forward if you encounter a collision, this will lead to overall less collisions. Also I think your `i` should be 0 initially, since double hashing is for resolving hash conflicts, so for the first iteration you should just try to insert it on the index h1 gives you back. – Nopileos Jun 28 '23 at 21:10