2

I have seen a couple questions that ask "do two 16-bit hashes have the same collision rate as a 32-bit hash?" or "do two 32-bit hashes have the same collision rate as a 64-bit hash?" And it seems like the answer is "yes, if they're decent hash functions that are not correlated". But what does that mean?

The author of MurmurHash3 stated this:

MurmurHash2_x86_64 computes two 32-bit results in parallel and mixes them at the end, which is fast but means that collision resistance is only as good as a 32-bit hash. I suggest avoiding this variant.

He advises against using MurmurHash2_x86_64, yet mentions no such advisory about MurmurHash3_x86_128 which appears to mix four 32-bit results to produce a 128-bit result.

And that function even seems worse: The output of h3 and h4 will always collide if the message is under 8 bytes. h2 is also prone to collide, creating results like this 100% of the time:

seed = 0, dataArr = {0}
h1 = 2294590956, h2 = 1423049145 h3 = 1423049145, h4 = 1423049145

seed = 0, dataArr = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
h1 = 894685359, h2 = 2425853539, h3 = 2425853539, h4 = 2425853539

Another example: Hash of "bryc" - e87e2554db409442db409442db409442
db409442 repeats 3 times

Any combination of null bytes with a length under 16 will result in those collisions, regardless of the seed.

Anyway, if what Appleby says is true about his function, that the collision resistance of the two 32-bit results is no better than a single 32-bit result, why is it that every time I force a collision in one result, without fail, the other is unaffected? Collisions in just one hash are exponentially more common.

Collisions of h1 in MurmurHash2_x86_64...
[2228688450, 3117914388] !== [2228688450, 2877485180]
[957654412, 3367924496] !== [957654412, 762057742]
[1904489323, 1019367692] !== [1904489323, 1894970953]
[2752611220, 3095555557] !== [2752611220, 2609462765]

The reason I ask this, is because I want to implement a 64-bit (or greater) hash in JavaScript for decent error detection. 32-bit hash functions aren't good enough. And no currently available solution on GitHub is fast enough. Since JavaScript uses 32-bit bitwise integers, only functions that use arithmetic on uint32_t are compatible in JS. And many 32-bit functions seem capable of producing a larger output without too much performance loss.

I already implemented (in JavaScript) MurmurHash2_x86_64 and MurmurHash3_x86_128 and their performance is impressive. I also implemented MurmurHash2_160.

Do all of these have the same collision resistance as a 32-bit hash? How can you tell if the results are correlated enough to be an issue? I want a 64-bit output to have the strength of a 64-bit hash, a 160-bit output as strong as a 160-bit hash etc. - while under the requirement of 32-bit arithmetic (JavaScript limitation).

Update: Here is my custom 64-bit hash, designed for speed (faster than my optimized 32-bit MurmurHash3 under Chrome/Firefox).

function cyb_beta3(key, seed = 0) {
    var m1 = 1540483507, m2 = 3432918353, m3 = 433494437, m4 = 370248451;
    var h1 = seed ^ Math.imul(key.length, m3) + 1;
    var h2 = seed ^ Math.imul(key.length, m1) + 1;

    for (var k, i = 0, chunk = -4 & key.length; i < chunk; i += 4) {
        k = key[i+3] << 24 | key[i+2] << 16 | key[i+1] << 8 | key[i];
        k ^= k >>> 24;
        h1 = Math.imul(h1, m1) ^ k; h1 ^= h2;
        h2 = Math.imul(h2, m3) ^ k; h2 ^= h1;
    }
    switch (3 & key.length) {
        case 3: h1 ^= key[i+2] << 16, h2 ^= key[i+2] << 16;
        case 2: h1 ^= key[i+1] << 8, h2 ^= key[i+1] << 8;
        case 1: h1 ^= key[i], h2 ^= key[i];
                h1 = Math.imul(h1, m2), h2 = Math.imul(h2, m4);
    }
    h1 ^= h2 >>> 18, h1 = Math.imul(h1, m2), h1 ^= h2 >>> 22;
    h2 ^= h1 >>> 15, h2 = Math.imul(h2, m3), h2 ^= h1 >>> 19;

    return [h1 >>> 0, h2 >>> 0];
}

It's based on MurmurHash2. Each internal state h1, h2 are initialized separately, but are mixed with the same chunk of the key. Then they are mixed with the alternate state (e.g. h1 ^= h2). They are mixed again at the end as part of finalization.

Is there anything to suggest this is weaker than a true 64-bit hash? It passes my own basic avalanche/collision tests correctly, but I am no expert.

bryc
  • 12,710
  • 6
  • 41
  • 61

1 Answers1

1

The difference between MurmurHash2_x86_64 and MurmurHash3_x86_128 is that the former only does one [32-bit 32-bit] -> 64-bit mix, while the latter does a 128-bit mix in each 16 bytes (though not a full-fledged mix, but it is enough for this purpose).

So, logically, MurmurHash2_x86_64 splits the input into 2 totally separated streams, calculates a 32-bit hash for each of them, then mix the two 32-bit result into a 64-bit one. So this is not a true 64-bit hash. For example, if one stream damaged, but incidentally retains the same hash value, this damage won't get noticed. And this event has approximately the same probability, as if you had a 32-bit hash in the first place. So this hash has less than 64-bit strength.

On the other hand, MurmurHash3_x86_128 internally has a 128-bit state, which is mixed each 16 input bytes (i.e., all the 16 byte input affects the internal state almost immediately, not just at the end), so this is a true 64-bit hash.

geza
  • 28,403
  • 6
  • 61
  • 135
  • Isn't `MurmurHash3_x86_128` a 128-bit hash, and `MurmurHash2_160` a 160-bit hash? So from what you're saying, I see that `MurmurHash2_x86_64` doesn't mix the internal states `h1` and `h2` until the end. If it were to do so within the input mixing phase similar to `MurmurHash3_x86_128` by addition or XOR of the alternate state, would it then have the collision properties of a proper 64-bit hash? To that effect, I made a [slight modification (see diff)](http://www.mergely.com/oUkOyG3A/) to more closely match the others using XOR instead of addition. – bryc Apr 04 '18 at 18:04
  • Also: Is there a test that can be done to prove that the 64-bit output of `MurmurHash2_x86_64` collides more than a **true 64-bit hash**? Simply searching for collisions of `h1` to see if `h2` ever collides as well, seems to be ineffective. – bryc Apr 04 '18 at 19:08
  • @bryc: "would it then have the collision properties of a proper 64-bit hash?": Yes, if you do the mix properly. But it is black art, what "properly" means here. There are tests for hash functions. But, for example, if a hash passes SMHasher, it doesn't mean it's perfect. – geza Apr 05 '18 at 03:59
  • @bryc: I've given an example, where strength is clearly lower than 64-bit, so there's no need for a test to prove it. But anyway, if you want to play with these things, you may want to experiment with narrower hashes, like 8-bit, to make hash strength differences more pronounced with small computation efforts. – geza Apr 05 '18 at 04:03
  • @bryc: I believe, MurmurHash2_x86_64 designed this way, so it is fast on metal (the cpu can reorder instructions, so it can process 2 streams in a more parallel fashion). If you aim for interpreted javascript, this may not matter too much. – geza Apr 05 '18 at 04:05
  • @bryc: note, I've edited my answer a little bit, because I'm not really sure, what strength `MurmurHash2_x86_64` has, but it is 100% sure, that it is less than 64-bit. – geza Apr 05 '18 at 04:08
  • 1
    @bryc: FYI: other hashes uses the same idea, for performance reasons, for example, XXHash. It uses 4 parallel streams, then mixes the result (again, for perfomance reasons). But I think it is hard to create a test which proves that it has a little bit less strength than 32-bit. – geza Apr 05 '18 at 04:12