12

I've been musing about this for some time: how exactly is Object.GetHashCode implemented in the CLR or Java? The contract for this method is that if it is called on the same object instance, it should always return the same value.

Note that I'm talking about the default implementation of GetHashCode(). Derived classes are not required to override this method. If they choose not to do so, they will in essence have reference semantics: equality equals "pointer equality" by default when used in hash tables &c. This means that somehow, the runtime has to provide a constant hashcode for the object throughout its lifetime.

If the machine I'm running on is 32-bit, and if the object instance never moved in memory, one could theoretically return the address of object, reinterpreted as Int32. That would be nice since all distinct objects have distinct addresses and therefore would have different hash codes.

However, this approach is flawed, amongst other things because:

  • if the garbage collector moves the object in memory, its address changes, and so would its hash code in violation of the contract that the hashcode should be the same for the lifetime of the object.

  • On a 64-bit system, the object's address is too wide to fit into Int32.

  • Because managed objects tend to be aligned to some even power of 2, the bottommost bits will always be zero. This may cause bad distribution patterns when the hash codes are used to index into a hash table.

In .NET, a System.Object consists of a sync block and a type handle and nothing more, so the hashcode cannot be cached in the instance itself. Somehow the runtime is able to provide a persistent hashcode. How? And how do Java, Mono, and other runtimes do this?

John Källén
  • 7,551
  • 31
  • 64

3 Answers3

9

No, not the address, that can't work with a garbage collector moving objects. It is intuitively simple, it can be a random number as long as it is stored after it is generated. It does get stored in the object, the syncblk. That field stores more than one object property, it is replaced by an index for an allocated syncblk if more than one such property needs to be stored.

The .NET algorithm uses the managed thread ID so that threads are not likely to generate the same sequence:

inline DWORD GetNewHashCode()
{
    // Every thread has its own generator for hash codes so that we won't get into a situation
    // where two threads consistently give out the same hash codes.        
    // Choice of multiplier guarantees period of 2**32 - see Knuth Vol 2 p16 (3.2.1.2 Theorem A)
    DWORD multiplier = m_ThreadId*4 + 5;
    m_dwHashCodeSeed = m_dwHashCodeSeed*multiplier + 1;
    return m_dwHashCodeSeed;
}

The seed is stored per-thread so no lock is required. At least that's what is used in the SSCLI20 version. No idea about Java, I imagine it is similar.

Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
  • Thank you for your answer, that makes a lot of sense. My thinking was locked on the idea that the syncblock could only store a single thing, but the mechanism you sketch here explains how multiple extra properties could be added to an object on a by-need basis. – John Källén Apr 07 '11 at 14:07
4

As a JVM implementer, I can say that the base hashcode IS typically related to the address of the object. It's not typically exactly the address, but some mangling of it in reasonable ways. We do magic to ensure the hashCode is stable through the life of the object (even across GC, even if the object moves, etc..)

I strongly recommend implementing a good type-specific hashCode() for all objects you're going to be hashing. That Object implements it doesn't mean it's ideal for your use.

Trent Gray-Donald
  • 2,286
  • 14
  • 17
  • 1
    Magically . Seriously, I don't mean to be evasive, but it's sometimes a source of competitive advantage. What I can say is that some versions of our JVM have bits reserved in the "hash and flags" portion of our header. That's typically not a full 32 bits for hash, so some duplication is used (and a resulting loss in potential entropy). Other options include remembering whether an object has been hashed, and thus remembering the "original" hash if the object needs to be moved. It might be saved in some sort of out of line data structure, or potentially in another part of the object. – Trent Gray-Donald Apr 09 '11 at 19:29
0

I'm not sure what you mean with "how exactly is Object.GetHashCode implemented in the CLR or Java?". Java's "public int hashCode()" has the contract that the author of a class should define the hashCode() implementation for it. In other words, it could vary widely between classes. I suspect this would be true for .Net platforms as well.

The Javadoc for Object describes an approach similar to your idea: http://download.oracle.com/javase/1.4.2/docs/api/java/lang/Object.html#hashCode()

As much as is reasonably practical, the hashCode method defined by class Object does return distinct integers for distinct objects. (This is typically implemented by converting the internal address of the object into an integer, but this implementation technique is not required by the JavaTM programming language.)

This approach is not appropriate if you have defined equality for your class to be based on something other than identity.

Jolta
  • 2,620
  • 1
  • 29
  • 42
  • 2
    You're not required to implement GetHashCode when you derive from the Object class. By not doing so, you've implemented reference semantics. The JavaDoc above implies that "typically", the hash code for Objects (and any derived classes not overriding its implementation of GetHashCode) will return the address of the object. If the GC moves your object, you will now have a different hashcode for the same object than you did before GC. This won't play well with Hashtables, I'm guessing. – John Källén Apr 07 '11 at 13:51
  • You're right, I'm not required to, that's why I put "should" in the second sentence, not "must". =) I still think it's a good practice, like Trent Gray-Donald stated in his answer. – Jolta Apr 12 '11 at 08:37