Simple hashcode in hashmap misconception?

Question

I am implementing my own specialized hashmap which has generic value types, but keys are always of type long. Here and there, I am seeing people suggesting that I should multiply key by a prime and then get modulo by number of buckets:

int bucket = (key * prime) % numOfBuckets;

and I don't understand why? It seems to me that it has exactly the same distribution as simple:

int bucket = key % numOfBuckets;

For example, if numOfBuckets is 8, with second "algorithm" we get buckets like {0, 1, 2, 3, 4, 5, 6, 7} repeating for key = 0 to infinity. In first algorithm for same keys we get buckets {0, 3, 6, 1, 4, 7, 2, 5} (or similar) also repeating. Basically we have the same problem like when using identity hash.

Basically, in both cases we get collisions for keys:

key = x + k*numOfBuckets (for k = 1 to infinity; and x = key % numOfBuckets)

because when we get modulo by numOfBuckets we always get x. So, what's the deal with first algorithm, can someone enlighten me?

you have to question another thing here, about `%`. hashcodes can be negative and a `%` would return a negative bucket... at least in java — Eugene, Jun 27 '17 at 20:41

score 2 · Accepted Answer · answered Jun 27 '17 at 20:05

If numOfBuckets is a power of two and the prime is odd (which seems to be the intended use case), then we have gcd(numOfBuckets, prime) == 1. That in turn means there is a number inverse such that inverse * numOfBuckets = 1 (mod numOfBuckets), so the multiplication is a bijective operation that just shuffles the buckets around a bit. That is of course useless, so your conclusions are correct.

Or perhaps more intuitively: in a multiplication information only flows from the lowest bit to the highest, never in reverse. So any of the bits that the bucket index would not rely on without the multiplication, are still discarded with the multiplication.

Some other techniques do help, for example Java's HashMap uses this:

/**
 * Applies a supplemental hash function to a given hashCode, which
 * defends against poor quality hash functions.  This is critical
 * because HashMap uses power-of-two length hash tables, that
 * otherwise encounter collisions for hashCodes that do not differ
 * in lower bits. Note: Null keys always map to hash 0, thus index 0.
 */
static int hash(int h) {
    // This function ensures that hashCodes that differ only by
    // constant multiples at each bit position have a bounded
    // number of collisions (approximately 8 at default load factor).
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

An other thing that works is multiplying by some large constant and then using the upper bits of the result (which contain a mixture of the bits below them, so all bits of the key can be used that way).

Thanks, the first two paragraphs were the answer I was looking for. — vlada, Jun 27 '17 at 20:23
Just one more question, some people who preach about multiplying by prime number state that this is the algorithm in Knuth's "The Art of Computer Programming" where he multiplies by 2654435761. Does Knuth's hashing algorithm have different use case or they just completely misunderstood the algorithm? — vlada, Jun 27 '17 at 20:38
@harold the internals of the re-hash have changed in jdk-8; it's a simple XOR between first 16 and last 16 bits — Eugene, Jun 27 '17 at 20:39
@vlada Knuth's multiplicative hash must be shifted to the right to reduce it, instead of ANDed or reduced modulo the length. That is something that's often forgotten. — harold, Jun 27 '17 at 21:13

Simple hashcode in hashmap misconception?

1 Answers1