-2

I'm implementing a hashmap to contain all words in a word file (e.g dictionary.txt, bible.txt) and I am having a collision problem. I know that there are many good hash functions out there but when I try compressing the hash code using this compression function, the number of collisions raises significantly (I'm using dbj2 for my hash function).

My hashmap basically converts a key to its hash value and compresses that hash value to the index of the entry in the internal hash table, which is an array. It resizes itself to 2 * capacity - 1 if the load factor of 0.5 is reached. When collisions happen, my hashmap generates new indexes using quadratic probing.

This is what my current compress function looks like:

private int compress(int hashCode) {
    return Math.abs(hashCode) % capacity;
}

Is there any (efficient) way I can do to avoid collisions? Changing the structure of the hashmap itself is also accepted.

Mike Pham
  • 437
  • 6
  • 17
  • There's no good reason to have a lot of collisions, except that your hash algorithm is poor. Please show us more code. Also `Math.abs()` probably doesn't do what you think it does in this case. Read the docs. – markspace Oct 13 '18 at 00:31

2 Answers2

0

I would suggest using a double hashing algorithm.

  • It will avoid clustering by giving different search paths for keys that collide
  • Constant time search/ insert
  • Higher load factor (a) which will allow you to use a smaller table (compressed)
bsheps
  • 1,438
  • 1
  • 15
  • 26
0

Your "compression" of the hashcode is turning a relatively good hash function into a poor one.

There is basically only one practical solution to this. Stop doing it. Just use the full 32 bit hashcodes. They are not compressible. Anything you do to reduce the size of the hashcodes will inevitably increase the collision rate.


The problem of mapping 32 bit hashcodes onto array indexes is a different one. For that, you should use hashcode % array.length.

If that is giving you an excessive collision rate then either your original hash function is poor, or there is some other bug or design problem in your implementation, or ... you just got unlucky.

But it could also be a problem with the way you are gathering your stats on collisions, or a problem with your expectations.


It is also worth noting that you are using an open addressing scheme. The Wikipedia article says this:

A drawback of all these open addressing schemes is that the number of stored entries cannot exceed the number of slots in the bucket array. In fact, even with good hash functions, their performance dramatically degrades when the load factor grows beyond 0.7 or so. For many applications, these restrictions mandate the use of dynamic resizing, with its attendant costs.

In fact, if you think about it, the effects of collisions in any open addressing scheme are more pronounced than when you use separate hash chains.


Finally, implementing performant hash tables from scratch is difficult, especially if you don't first read up on the literature on the subject. (Asking on StackOverflow is NOT a good way to do your research!)

Community
  • 1
  • 1
Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • But then I'm using an array for a hash table. How can I fit the 32 bit hashcode to use it as an index of a small size array? – Mike Pham Oct 13 '18 at 01:39