1

Sorry to combine two questions into one, they are related.

HashCodes for HashSets and the such. As I understand it, they must be unique, not change, and represent any configuration of an object as a single number.

My first question is that for my object, containing the two Int16s a and b, is it safe for my GetHashCode to return something like a * n + b where n is a large number, I think perhaps Math.Pow(2, 16)?

Also GetHashCode appears to inflexibly return specifically the type Int32.

32bits can just about store, for example, two Int16s, a single unicode character or 16 N, S, E, W compass directions, it's not much, even something like a small few node graph would probably be too much for it. Does this represent a limit of C# Hash collections?

alan2here
  • 3,223
  • 6
  • 37
  • 62

2 Answers2

7

As I understand it, they must be unique

Nope. They can't possibly be unique for most types, which can have more than 232 possible values. Ideally, if two objects have the same hash code then they're unlikely to be equal - but you should never assume that they are equal. The important point is that if they have different hash codes, they should definitely be unequal.

My first question is that for my object, containing the two Int16s a and b, is it safe for my GetHashCode to return something like a * n + b where n is a large number, I think perhaps Math.Pow(2, 16).

If it only contains two Int16 values, it would be simplest to use:

return (a << 16) | (ushort) b;

Then the value will be unique. Hoorah!

Also GetHashCode appears to inflexibly return specifically the type Int32.

Yes. Types such as Dictionary and HashSet need to be able to use the fixed size so they can work with it to put values into buckets.

32bits can just about store, for example, two Int16s, a single unicode character or 16 N, S, E, W compass directions, it's not much, even something like a small few node graph would probably be too much for it. Does this represent a limit of C# Hash collections?

If it were a limitation, it would be a .NET limitation rather than a C# limitation - but no, it's just a misunderstanding of what hash codes are meant to represent.

Eric Lippert has an excellent (obviously) blog post about GetHashCode which you should read for more information.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Thanks for a clear answer. They don't have to be unique? So it could "return 0;" and it would still work, if inefficiently? – alan2here Apr 14 '12 at 18:16
  • 2
    @alan2here: Yes. Returning a constant is always a valid strategy for a hash code - but one which defeats all the normal efficiencies associated with hash tables etc. – Jon Skeet Apr 14 '12 at 18:17
  • ", if two objects have the same hash code then they're unlikely to be equal - but you should never assume that they are unequal" You messed up this sentence a bit – CodesInChaos Apr 14 '12 at 18:18
  • So the hash code is a sort of numerical estimate for similarity. Also I'm getting "Bitwise-or operator used on a sign-extended operand; consider casting to a smaller unsigned type first" warning on "return (a << 16) | b;", I'm ignoring it for now. – alan2here Apr 14 '12 at 18:24
  • @alan2here: No, not an estimate of similarity - two very similar values could have very different hash codes. *All* that the value represents is potential equality. And yes, you can ignore the warning - which I assume is from ReSharper? – Jon Skeet Apr 14 '12 at 18:25
  • I have no Visual Studio mods. Potential equality, ty :¬) – alan2here Apr 14 '12 at 18:27
  • 2
    Jon, as Hans Passant reminded Marc Gravell in comments on the question, @alan2here *should not* ignore that warning. If he does, for any negative `b` value, all `a, b` pairs will have the same hash code, regardless of the value of a. If b is less than zero, `((a << 16) | b) == b`, because of sign extension. – phoog Apr 14 '12 at 19:03
  • Thanks phoog. I've added the bracketed UInt16. – alan2here Apr 16 '12 at 20:21
1

GetHashCode is not (and cannot be) unique for every instance of an object. Take Int64, for example; even if the hash function is perfectly distributed, there will be two four billion Int64s that hash to every value, since the hash code is, as you mentioned, only an Int32.

However this is not a limitation on collections using hash codes; they are simply use buckets for elements which hash to the same value. So a lookup into a hash table isn't guaranteed to be a single operation. Getting the correct bucket is a single operation, but there may be multiple items in that bucket.

goric
  • 11,491
  • 7
  • 53
  • 69
  • 4
    If the hash function is well distributed then of course there will be *four billion* collisions per 32 bit hash. – Eric Lippert Apr 14 '12 at 18:54
  • 1
    Not 2 `Int64`s per `Int32` hash value, but 4294967296. 2^64 / 2^32 equals 2^32. – phoog Apr 14 '12 at 19:06
  • @EricLippert, phoog: of course you're correct. Something in my brain saw 32 and 64 and automatically went to 64/32 rather than 2^64/2^32... – goric Apr 15 '12 at 00:45