I am currently working on choosing a couple of general-purpose hashing functions for use in Object.GetHashCode()
overrides. Initially, on the recommendation of this site I started with ELF. My C# implementation is below:
public int Generate(byte[] key) {
const uint c = 0xf0000000;
uint h = 0,
g = 0;
unchecked {
for (int i = 0, len = key.Length; i < len; i++) {
h = (h << 4) + key[i];
if ((g = h & c) != 0)
h ^= g >> 24;
h &= ~g;
}
}
return (int)h;
}
My test case consists of 524,288 unique values divided into short (1-64) and long (256-2048) strings (limited ASCII character set) and arbitrary binary data (131,072 each) to test each algorithm under a variety of circumstances.
I also understand the limitations of this test scenario. A hashing algorithm may perform exceptionally well when hashing, say, URLs, but be awful at hashing JPGs or something. Random strings/binary seems to me to be the best starting point for choosing a general purpose function though. I am happy to hear reasons for why this is not the case.
I performed 3 separate test runs (generating a new set of random strings/bytes each time) and averaged the results.
The ELF algorithm produced a horrific number of collisions in comparison to the other algorithms I'm testing:
- Short strings: 817 collisions (~0.5% fail rate).
- Short binary: 550 collisions (~0.4% fail rate)
- Long strings/binary: 34 collisions (~0.025% fail rate).
To place this in context, the other 3 algorithms I tested produced on average between 3-10 collisions on average for the same tests. It is also amongst the slowest of the 4, so at this point it appears to be entirely useless.
Full results:
Strings Binary Algorithm short:long short:long ELF 817:40 550:28 FNV 1.6:2 0.6:2.6 OAT 9:9.6 14:5 Jenkins* 2:1.3 12:3.6 * A close approximation of the lookup3 hash function.
So for the same random samples that ELF is struggling on (I have generated 3 separate sets), all other tested algorithms are producing way way fewer collisions.
I have searched for variants of the ELF algorithm, but the few examples I have found seem consistent with what I have implemented. The only variation I have seen was on this SO question: Using ELF to produce a tweaked hashmap. This variation includes h &= g >> 24
within the if-block, and clips the result to 31 bits. I tested that variation and it produced the same awful results.
Have I done something subtley but horribly wrong? I can't understand why it's performing so badly given that it is allegedly widely used in Unix.