0

I'm using 32-bit FNV-1a hashing, but now I want to reserve one of the bits to hold useful information about the input key. That is, I want to use only 31 of the 32 bits for hash and 1 bit for something else.

Assuming FNV is well distributed for my application, is it safe to assume that dropping 1 bit this will increase collision rate by 32/31, as opposed to something dramatic?

The algo recommends XOR the discarded MSB with the LSB, but for 1-bit, that seems pointless. As such, would it matter which bit is discarded (MSB or LSB)? And if not, would it matter if the LSB MSB were discard after hashing each byte (i.e. using a even numbered "prime") or after 32-bit hashing the entire byte-array first.

codechimp
  • 1,509
  • 1
  • 14
  • 21
  • Whatever else you do, definitely *don't* use [an even multiplier](https://ideone.com/CFRYrH). That progressively discards the influence of the prefix of the string. – harold Jan 15 '22 at 13:48
  • @harold, please explain. The FNV1a op is `((int)sum ^ (char)byte) * (int) prime`. I'm think of discarding the MSB `( ((int) sum ^ (char)byte) * (int) prime) ) << 1`, i.e. using `2*prime` – codechimp Jan 15 '22 at 15:32
  • @harold, Can I understand you to mean that MSB should be discard only at the end of hashing the entire byte array? If so, is the advice same when discarding LSB (i.e. ` ... >>1 ` )? – codechimp Jan 15 '22 at 15:47
  • 1
    With an even multiplier, after the first iteration, the top 31 bits depend on the first byte of the input. No problem. After the second iteration, only the top 30 bits depend on the first byte (and the top 31 depend on the second byte). 30 more iterations and then *none* of the bits depend on the first byte anymore, and just 1 depends on the second byte, and so on. It's OK to discard a bit *after* computing the hash (and I'd probably choose the lsb, because it the least "mixed") – harold Jan 15 '22 at 15:48
  • The collision rate is increased by a factor of (2\*\*32) / (2\*\*31), aka 2. – President James K. Polk Jan 15 '22 at 18:56

1 Answers1

1

Removing a single bit from a 32-bit hash code will have a larger effect than a 32/31 increase in the collision rate. To see why, note that there are 232 possible 32-bit hashes and 231 possible 31-bit hashes, meaning that removing a bit from the hash cuts the numbers of possible hashes down by a factor of two - a pretty significant reduction in the number of possible hashes. This brings about roughly a doubling of the probability that you see a hash collision across your hashes.

If you have a sufficiently small number of hashes that collisions are rare, then cutting out a single bit is unlikely to change much. But if collisions were already an issue, dropping a bit will roughly double the chance you see them.

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • Yeah. I don't know why I thought CR would increase only by 32/31 rather than double. Brain fart i guess. – codechimp Jan 17 '22 at 15:07