I have seen a couple questions that ask "do two 16-bit hashes have the same collision rate as a 32-bit hash?" or "do two 32-bit hashes have the same collision rate as a 64-bit hash?" And it seems like the answer is "yes, if they're decent hash functions that are not correlated". But what does that mean?
The author of MurmurHash3 stated this:
MurmurHash2_x86_64 computes two 32-bit results in parallel and mixes them at the end, which is fast but means that collision resistance is only as good as a 32-bit hash. I suggest avoiding this variant.
He advises against using MurmurHash2_x86_64
, yet mentions no such advisory about MurmurHash3_x86_128
which appears to mix four 32-bit results to produce a 128-bit result.
And that function even seems worse: The output of h3
and h4
will always collide if the message is under 8 bytes. h2
is also prone to collide, creating results like this 100% of the time:
seed = 0, dataArr = {0} h1 = 2294590956, h2 = 1423049145 h3 = 1423049145, h4 = 1423049145 seed = 0, dataArr = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0} h1 = 894685359, h2 = 2425853539, h3 = 2425853539, h4 = 2425853539 Another example: Hash of "bryc" - e87e2554db409442db409442db409442 db409442 repeats 3 times
Any combination of null bytes with a length under 16 will result in those collisions, regardless of the seed.
Anyway, if what Appleby says is true about his function, that the collision resistance of the two 32-bit results is no better than a single 32-bit result, why is it that every time I force a collision in one result, without fail, the other is unaffected? Collisions in just one hash are exponentially more common.
Collisions of h1 in MurmurHash2_x86_64... [2228688450, 3117914388] !== [2228688450, 2877485180] [957654412, 3367924496] !== [957654412, 762057742] [1904489323, 1019367692] !== [1904489323, 1894970953] [2752611220, 3095555557] !== [2752611220, 2609462765]
The reason I ask this, is because I want to implement a 64-bit (or greater) hash in JavaScript for decent error detection. 32-bit hash functions aren't good enough. And no currently available solution on GitHub is fast enough. Since JavaScript uses 32-bit bitwise integers, only functions that use arithmetic on uint32_t
are compatible in JS. And many 32-bit functions seem capable of producing a larger output without too much performance loss.
I already implemented (in JavaScript) MurmurHash2_x86_64 and MurmurHash3_x86_128 and their performance is impressive. I also implemented MurmurHash2_160.
Do all of these have the same collision resistance as a 32-bit hash? How can you tell if the results are correlated enough to be an issue? I want a 64-bit output to have the strength of a 64-bit hash, a 160-bit output as strong as a 160-bit hash etc. - while under the requirement of 32-bit arithmetic (JavaScript limitation).
Update: Here is my custom 64-bit hash, designed for speed (faster than my optimized 32-bit MurmurHash3 under Chrome/Firefox).
function cyb_beta3(key, seed = 0) {
var m1 = 1540483507, m2 = 3432918353, m3 = 433494437, m4 = 370248451;
var h1 = seed ^ Math.imul(key.length, m3) + 1;
var h2 = seed ^ Math.imul(key.length, m1) + 1;
for (var k, i = 0, chunk = -4 & key.length; i < chunk; i += 4) {
k = key[i+3] << 24 | key[i+2] << 16 | key[i+1] << 8 | key[i];
k ^= k >>> 24;
h1 = Math.imul(h1, m1) ^ k; h1 ^= h2;
h2 = Math.imul(h2, m3) ^ k; h2 ^= h1;
}
switch (3 & key.length) {
case 3: h1 ^= key[i+2] << 16, h2 ^= key[i+2] << 16;
case 2: h1 ^= key[i+1] << 8, h2 ^= key[i+1] << 8;
case 1: h1 ^= key[i], h2 ^= key[i];
h1 = Math.imul(h1, m2), h2 = Math.imul(h2, m4);
}
h1 ^= h2 >>> 18, h1 = Math.imul(h1, m2), h1 ^= h2 >>> 22;
h2 ^= h1 >>> 15, h2 = Math.imul(h2, m3), h2 ^= h1 >>> 19;
return [h1 >>> 0, h2 >>> 0];
}
It's based on MurmurHash2. Each internal state h1
, h2
are initialized separately, but are mixed with the same chunk of the key. Then they are mixed with the alternate state (e.g. h1 ^= h2
). They are mixed again at the end as part of finalization.
Is there anything to suggest this is weaker than a true 64-bit hash? It passes my own basic avalanche/collision tests correctly, but I am no expert.