Why such a high collision rate with my ELF hash implementation

Question

I am currently working on choosing a couple of general-purpose hashing functions for use in Object.GetHashCode() overrides. Initially, on the recommendation of this site I started with ELF. My C# implementation is below:

public int Generate(byte[] key) {
    const uint c = 0xf0000000;

    uint h = 0,
         g = 0;

    unchecked {
        for (int i = 0, len = key.Length; i < len; i++) {
            h = (h << 4) + key[i];

            if ((g = h & c) != 0)
                h ^= g >> 24;

            h &= ~g;
        }
    }

    return (int)h;
}

My test case consists of 524,288 unique values divided into short (1-64) and long (256-2048) strings (limited ASCII character set) and arbitrary binary data (131,072 each) to test each algorithm under a variety of circumstances.

I also understand the limitations of this test scenario. A hashing algorithm may perform exceptionally well when hashing, say, URLs, but be awful at hashing JPGs or something. Random strings/binary seems to me to be the best starting point for choosing a general purpose function though. I am happy to hear reasons for why this is not the case.

I performed 3 separate test runs (generating a new set of random strings/bytes each time) and averaged the results.

The ELF algorithm produced a horrific number of collisions in comparison to the other algorithms I'm testing:

Short strings: 817 collisions (~0.5% fail rate).
Short binary: 550 collisions (~0.4% fail rate)
Long strings/binary: 34 collisions (~0.025% fail rate).

To place this in context, the other 3 algorithms I tested produced on average between 3-10 collisions on average for the same tests. It is also amongst the slowest of the 4, so at this point it appears to be entirely useless.

Full results:

           Strings     Binary
Algorithm  short:long  short:long
ELF        817:40      550:28
FNV        1.6:2       0.6:2.6
OAT        9:9.6       14:5
Jenkins*   2:1.3       12:3.6

* A close approximation of the lookup3 hash function.

So for the same random samples that ELF is struggling on (I have generated 3 separate sets), all other tested algorithms are producing way way fewer collisions.

I have searched for variants of the ELF algorithm, but the few examples I have found seem consistent with what I have implemented. The only variation I have seen was on this SO question: Using ELF to produce a tweaked hashmap. This variation includes h &= g >> 24 within the if-block, and clips the result to 31 bits. I tested that variation and it produced the same awful results.

Have I done something subtley but horribly wrong? I can't understand why it's performing so badly given that it is allegedly widely used in Unix.

I just believe that the designers of ELF symbol tables where less expert in hashing than you are... — Basile Starynkevitch, Feb 06 '12 at 12:09
What's the question - "Is my implementation of ELF correct?" — Ben, Feb 06 '12 at 12:20
"...alledgedly widely used in Unix." << So what? Used in Unix != Fast and Best. — Greg D, Feb 07 '12 at 23:03
@GregD - My point was that it is far more likely that my ELF implementation is flawed than the alternative, which is that the creators and maintainers of Unix chose a poor hash algorithm. Nobody said anything about it being the fastest or the best. — Quick Joe Smith, Feb 08 '12 at 01:34

Ben · Answer 1 · 2012-02-08T14:05:33.280

It is not a cryptographic hash, it is a hashtable hash.

This is a perfectly reasonable performance for a hash function intended for use in a hash table. Typically you will be storing between hundreds and hundreds of thousands of objects and will want to quickly store and retrieve the objects.

You do this by dividing into buckets, each containing a linked list (or maybe an array). You then calculate the hash, and taking the remainder when dividing by the number of buckets, you locate the bucket. Then you walk the linked list comparing each object to find the one you want.

If the bucket is empty the object is not found. You can then either create one or take the other appropriate action depending on your application.

The hashtable should be sized to have approximately the same number of buckets as the expected number of items to store (or a few more), so most searches will find a bucket with zero, one or two entries.

For performance you want to balance the expense of calculating the hash against the expense of traversing a very short linked list if you get a collision. It is with this in mind that the implementation of ELF and similarly purposed hash functions are designed.

In short:

In a hashtable, the occasional collision is a price worth paying for a faster hash.
In a cryptographic hash, a slow hash is a price worth paying for avoiding collisions.

If collisions are a problem in your application, use SHA1 or SHA256 or something designed with that in mind.

Note: For your use as an implementation of object.GetHashCode() the hash code is only intended to speed comparisons ("fail fast") and for use in hash tables. You don't need it to be completely collision resistant since you are going to fall back to a full equality comparision if it collides. You need balanced performance. I suggest just hashing the most important fields (using their own GetHashCode()) and XORing the values.

Edit: See also these hashes here:

http://www.burtleburtle.net/bob/hash/doobs.html

I was skeptical of XORing because of how easily it can produce collisions, so I rolled a quick & dirty XorHash class and it beat my ELF implementation hands down. My intuitions are not serving me very well at all. — Quick Joe Smith, Feb 07 '12 at 01:37
I've also updated the question to include the results of the other algorithms and emphasise that my question is about why my ELF implementation is producing so many collisions _relative to the other algorithms_. I apologise for not making that more clear upfront. — Quick Joe Smith, Feb 07 '12 at 02:25
Naively XORing introduces specific types of failures. It is up to you to understand if those failures apply to your code. The most well-known issue is that XOR is symmetric. — Brian, Feb 07 '12 at 15:04
@brian, agreed. If you have two objects which can be ("a","b") and ("b","a") then producing a hash by xoring the hashes of the fields will make those two objects collide. It really depends if that is a common case or not. For some objects it will be. — Ben, Feb 07 '12 at 16:38

Eric Lippert · Answer 2 · 2012-02-06T15:56:54.483

The expected number of collisions in 524000 random samples on a 32 bit hash is 34.

You're getting 34 collisions with long strings, so for long strings, this algorithm is performing more or less as expected.

Hash collisions are far, far more likely on short strings since there is so much less entropy in the data, so it is in no way suprising to me that you're getting orders of magnitude worse performance on small strings.

It is suprising that you are getting only ten collisions with other hash algorithms. I would have expected a lot more.

On the subject of raw speed performance: you might do better to stop being so clever. The jitter may recognize and optimize the extremely common pattern:

for(int i = 0; i < array.Length; ++i)
    do something with array[i]

so as to avoid the recomputation of Length and avoid the range check on the array access. By trying to be clever and avoid the recomputation of Length, you might be fooling the jitter into no longer optimizing away the range check.

If you wish to always avoid the range check, you can always go to unsafe code; fix the array in place, obtain a pointer to it, and then increment the pointer, like you're writing the program in C. You take responsibility for ensuring the memory safety at that point, but odds are good it will be faster.

Of course, that "armchair" performance analysis is worth exactly what you paid for it; to get a real analysis, try it and see what happens.

the ELF hash function appears to not have any avalanching phase. This is possibly why it performs poorly on small sets of inputs. — ShuggyCoUk, Feb 06 '12 at 17:59
or the OP's generation of input corpus is not randomly distributed so is inherently not likely to be not well distributed in the output without considerably more entropy through more characters of course — ShuggyCoUk, Feb 06 '12 at 18:01
I should have been more clear in my question (I always overlook some detail), but there are 131,072 samples in each category (short/long string/binary) for a total of ~500K. So for its worst category, it's 1 collision per 160 inputs. — Quick Joe Smith, Feb 06 '12 at 22:11
Eric Lippert - Also, thanks for the tip about the optimiser and array traversal! I will add the "store array length" trick to my list of discarded performance hacks along with "++i instead of i++". — Quick Joe Smith, Feb 08 '12 at 02:21

score 0 · Answer 3 · answered Feb 06 '12 at 12:38

0

The actual ELF implementation returns unsigned long, and the source uses unsigned long internally. I can't say for sure but my intuition is that your implementation is just throwing away too many interesting bits by dealing in int.

answered Feb 06 '12 at 12:38

Ernest Friedman-Hill

80,601
10
150
186

3

An `unsigned long` in C(++) is often equivalent to a `uint` in C#. (I say "often" and not "always" because it's implementation-defined.) And from a bit-twiddling perspective, `int` and `uint` are both just a sequence of 32 bits. I'd guess that no bits are being thrown away here. – LukeH Feb 06 '12 at 12:46
I see, so am I correct in saying that this would render it less than ideal in the .NET world of 32-bit hashcodes? – Quick Joe Smith Feb 06 '12 at 12:48
@Quick Joe Smith: (A guess, again) Judging from the age of the Elf algorithm I'd guess that it's supposed to be 32- rather than 64-bit, in which case your implementation looks ok. I suspect that Elf is just not as collision-resistant as you might expect. – LukeH Feb 06 '12 at 12:57
the OP should actually have unchecked around the final cast of h as well (given he uses it in the combining phase), that said no bits of state are lost assuming this is true – ShuggyCoUk Feb 06 '12 at 18:13

Why such a high collision rate with my ELF hash implementation

3 Answers3