16

I am reading the code of the HashMap class provided by the Java 1.6 API and unable to fully understand the need of the following operation (found in the body of put and get methods):

int hash = hash(key.hashCode());

where the method hash() has the following body:

 private static int hash(int h) {
         h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

This effectively recalculates the hash by executing bit operations on the supplied hashcode. I'm unable to understand the need to do so even though the API states it as follows:

This is critical because HashMap uses power-of-two length hash tables, that otherwise encounter collisions for hashCodes that do not differ in lower bits.

I do understand that the key value pars are stored in an array of data structures, and that the index location of an item in this array is determined by its hash. What I fail to understand is how would this function add any value to the hash distribution.

Jeffrey Bosboom
  • 13,313
  • 16
  • 79
  • 92
VGDIV
  • 679
  • 1
  • 7
  • 16

4 Answers4

25

As Helper wrote, it is there just in case the existing hash function for the key objects is faulty and does not do a good-enough job of mixing the lower bits. According to the source quoted by pgras,

 /**
  * Returns index for hash code h.
  */
 static int indexFor(int h, int length) {
     return h & (length-1);
 }

The hash is being ANDed in with a power-of-two length (therefore, length-1 is guaranteed to be a sequence of 1s). Due to this ANDing, only the lower bits of h are being used. The rest of h is ignored. Imagine that, for whatever reason, the original hash only returns numbers divisible by 2. If you used it directly, the odd-numbered positions of the hashmap would never be used, leading to a x2 increase in the number of collisions. In a truly pathological case, a bad hash function can make a hashmap behave more like a list than like an O(1) container.

Sun engineers must have run tests that show that too many hash functions are not random enough in their lower bits, and that many hashmaps are not large enough to ever use the higher bits. Under these circumstances, the bit operations in HashMap's hash(int h) can provide a net improvement over most expected use-cases (due to lower collision rates), even though extra computation is required.

tucuxi
  • 17,561
  • 2
  • 43
  • 74
  • 3
    "just in case"? Actually, most hash codes in Java are going to be crappy. Just look at java.lang.Integer, for instance! But this actually makes sense. It's better to say "it's okay if everyone's Object.hashCode()s have crappy bit distribution, as long as they follow the equal-objects-have-equal-hashcodes rule, and try to avoid collisions as much as possible." Then only collection implementations like HashMap have the burden of passing those values through a secondary hash function, instead of it being everyone's problem. – Kevin Bourrillion Mar 30 '10 at 00:05
  • 'the odd-numbered positions of the hashmap would never be used' I don't understand it.Can you give an example? – Dean Chen Mar 14 '12 at 15:26
  • 2
    Ok, imagine I am hashing Employee objects, and all my Employees have an int ID field such as "400114", "400214", "400314", and so on (they all share the "14" part of their IDs because that is my department's suffix). The hashCode() method of Integer returns the integer itself -- so if I were to use employee-IDs as keys in a HashSet /without/ HashMap's hash(int h), the spread would be very, very uneven. In this example, since 14 is even, only even buckets would ever be used. – tucuxi Mar 30 '12 at 14:00
  • @tucuxi so can i think `hash(int h)` as a secondery hashing for even dustribution?? – roottraveller Jun 23 '17 at 07:34
2

I somewhere read this is done to ensure a good distribution even if your hashCode implementation, well, err, sucks.

helpermethod
  • 59,493
  • 71
  • 188
  • 276
  • Right, and the default hashcode() implementation in java.lang.Object doesn't have much distribution between hashes. – Sam Barnum Mar 29 '10 at 13:48
  • What i dont understand is that if each hash is unique (and the method in question does not - and cannot - address the problem of unique hashes), what problems does the mechanism face? It mentions something about collisions in lower order bits - but that's not very clear. – VGDIV Mar 29 '10 at 13:57
  • Each hash is by definition not unique... I cannot give a good answer to your question but the problems is in the "indexFor" method that returns "hashCode & (length-1)"... – pgras Mar 29 '10 at 14:08
2

as you know with the hashmap, the underlying implementation is a hashtable, specifically a closed bucket hash table. The load factor determines the appropriate amount of objects in the collection / total number of buckets.

Lets say you keep adding more elements. Each time you do so, and it's not an update, it runs the object's hashcode method and uses the number of buckets with the modulo operator to decide which bucket the object should go in.

as n(the number of elements in the collection) / m(the number of buckets) gets larger, your performance for reads and writes gets worse and worse.

Assuming your hashcode algorithm is amazing, performance is still contingent upon this comparison n/m.

rehashing is used also to change the number of buckets, and still keep the same load factor as which the collection was constructed.

Remember, the main benefit of any hash implementation is the ideal O(1) performance for reads and writes.

Jeff
  • 21
  • 1
1

As you know, object.hashCode() can be overridden by users, so a really bad implementation would throw up non random lower level bits. That would tend to crowd some of buckets and would leave many buckets unfilled.

I just created a visual map of what they are trying to do in hash. It seems that hash(int h) method is just creating a random number by doing bit level manuplation so that the resulting numbers are more randomly (and hence into buckets more uniformly) distributed.

Each bit is remapped to a different bit as follows:

        h1 = h1 ^ h13 ^ h21 ^ h9 ^ h6     
        h2 = h2 ^ h14 ^ h22 ^ h10 ^ h7
        h3 = h3 ^ h15 ^ h23 ^ h11 ^ h8
        h4 = h4 ^ h16 ^ h24 ^ h12 ^ h9
        h5 = h5 ^ h17 ^ h25 ^ h13 ^ h10

. . . .

till h12.

As you can see, each bit of h is going to be so so far away from itself. So it is going to be pretty much random and not going to crowd any particular bucket. Hope this help. Send me an email if you need full visual.

Vikas
  • 11
  • 1