1

I need to utilize the UUID datatype (128bit) for storing my hashes. The goal is to be able quickly calculate/identify different records by comparing it rather than millions of 1-10k char long strings. So the goal here is not the security (reverse vulnerability) but I suspect it goes hand in hand with that "uniqueness" or low collision rate that a particular hashing method offers.

I am not limited to any DB functionality here (although it's a pity DBs are usually quite behind in offering better/stronger hashing functions/algorithms, not mentioning wider (>128bit) datatypes to store UIDs) so can use any open source implementation.

So my original thinking was to use MD5 but I read somewhere it might not even utilize full potential of 128bits storing some control segments in it.

Then there's also a quite fast MurmurHash that I could use.

Lastly, I was thinking maybe, it would be better to take something solid, like SHA-3 and just take last 32chars from it?

(Where) Could I get some advice on the best (lowest collision likelyhood) between those aforementioned (and other possible) methods?

msciwoj
  • 772
  • 7
  • 23
  • 1
    just to clarify things, you want to hash some record's data in order to produce a unique fingerprint of the record, store the hash in UUID format and use that to compare records? Why not use SHA3 for this and use a TEXT field to store the result? Is there a specific reason why you have the requirement that your hash should be stored in a UUID data type? –  Dec 14 '22 at 09:51
  • @Spyros Yes, I have such constraint. One being native, dedicated data type for this binary identifier. Also it's simply more performant to count or lookup unique 32-char long strings rather than 100-200+ ones – msciwoj Dec 14 '22 at 15:29

1 Answers1

0

If you care about collisions, you should use none of MD5, SHA-1, or MurmurHash. MurmurHash is non-cryptographic, which means collisions are expected, and MD5 and SHA-1 are broken.

Appropriate options are SHA-2 (e.g., SHA-256), SHA-3, BLAKE2, or BLAKE3. All of these are cryptographic hash functions, and all of them provide very good cryptographic security, including equally good collision resistance for a given output size. My recommendation is to use a 256-bit output, because that provides 128-bit collision resistance; a 128-bit output only provides 64-bit collision resistance, which is not very good.

If you're going with a 256-bit output, SHA-256 is fastest if your CPU accelerates it (some recent ARM and Intel CPUs and most recent AMD CPUs do), and otherwise BLAKE3 tends to be the fastest. BLAKE2b-256 is still very fast and provides slightly better security than BLAKE3. SHA-3-256 is also fine, but slower.

If you're going with a 128-bit output, you can use a truncated SHA-256, SHAKE128 (which is in the Keccak family along with the SHA-3 algorithms), BLAKE2b-128 (or BLAKE2s-128), or BLAKE3.

bk2204
  • 64,793
  • 6
  • 84
  • 100
  • OP seems not to be worried about security, just uniqueness, in which case even MD5 is fine. – forest Dec 16 '22 at 01:01
  • I don't agree that MD5 is fine for uniqueness if the hash is supposed to be over a set of data (as opposed to CSPRNG output). MD5 is known to have collisions, and therefore it's trivial to find two messages which hash to the same output. – bk2204 Dec 16 '22 at 08:05
  • It's possible to find two messages which hash to the same output, but that only matters if security is relevant. The chance of random files that were not designed maliciously colliding is extremely low. – forest Dec 16 '22 at 22:16
  • BLAKE2b-128 or simply truncated SHA-256 looks like a winner then. While security is not a concern I would like to avoid possible collisions. I might be wrong but my worry is the security concerns are directly related to higher collision rates…? – msciwoj Dec 18 '22 at 09:01
  • 1
    A secure cryptographic hash function is going to give you the best possible collision resistance available, so yes, picking a non-cryptographic or broken function will give you a higher probability of collisions, or, if the data is attacker controlled, allow the attacker to create collisions. BLAKE2b-128 or truncated SHA-256 are both fine choices. – bk2204 Dec 18 '22 at 17:35