I have 16 bytes to hold a string hash. I understand that collisions are a fact of life when you reduce strings of arbitrary length to a fixed-length sequence of bytes, but I'd like to avoid them as much as possible. Am I better off using a deprecated algorithm like MD5 that has an output of 16 bytes, or the first 16 bytes of a yet-to-be-broken algorithm like SHA-256?
-
Next time, ask it on [crypto.se] – Artjom B. May 25 '17 at 08:28
1 Answers
Given that NIST defines SHA-224 as a truncated SHA-256, that's as official 'seal of approval' as you're ever going to get on the question "is it a good idea to truncate SHA-256 to fit size requirements?".
And since MD5 is utterly demolished and soon to join MD4 on the "don't use even for internal testing" shelf, the answer is pretty clear - go with a truncated SHA-256.
That being said, the moment you reduce it - the number of collisions will naturally increase. SHA-256 is statistically well spread so shortening shouldn't increase collisions more than what you get inevitably with only 128 bits (well, a bit more as no hash is perfect). Shortening will even come with a bonus of increasing already solid SHA-256 resistance to length extension attacks.
I know a lot of systems in the industry using halved SHA-512 for increased resistance to LAE (well, theoretical at this time) instead of SHA-256 - an additional bonus is performance boost on 64-bit systems when it comes to calculating SHA-512 vs SHA-256.
The most common form of truncation I've encountered is XOR-ing the first half with the second half. I'm not sure if it provides any additional benefits but people feel more at ease when they see 'unrecognizable' output from a 'truncator' so they just go with it.
UPDATE
As per deceze's suggestion - when a hash is qualified as "don't use even for internal testing" it means that it does a bad job for what it was designed to do and it should be avoided at all costs for that particular application but not necessarily for other applications.
Both MD4 and MD5 can be used as solid hashing algorithms in non-cryptographic settings, and I've seen systems re-purposing MD4 specifically for that - it's very fast, has a solid spread and if you're not too picky with collisions (say you're building a backup program that needs to know which files changed since the last backup) it can go head-to-head with some of the non-cryptographic hashes designed for those specific purposes.
However, more often than not, it's better to use the right tool for the job. Non-cryptographic hashes are designed first and foremost for speed, but also for spread and low collision rate, and some of them outshine even cryptographic hashes with their profile with only downside of being more or less predictable.
If you need a non-cryptographic hash, instead of resorting to broken cryptographic hashes, I'd suggest you take a look at some of the overall better hashes for non-cryptographic purposes like FNV-1/FNV-1a, Murmur and even plain CRC32 (a bit on the slow side, but faster than most cryptographic hashes). There is a really great comparison on speed, spread and collisions on SE so be sure to check it out.

- 24,943
- 3
- 48
- 66
-
What could be a possible issue with using MD5 for a purpose where there's basically no attack surface, e.g. as function to compute a key in a hash table which deals with collisions anyway? Is there anything apart from "let's just all forget about MD5 entirely"? – deceze May 25 '17 at 02:10
-
1@deceze - there are non-cryptographic hashes with better spread/low collision rate profiles than MD5 for a fraction of computational cost like FNV-1, SDBM, Murmur2 or even good ol' CRC32. Of course, when we talk about 'not even for testing' we mean for using it in applications it was designed for - as a cryptographic hash - and it fails miserably in that regard. – zwer May 25 '17 at 02:26
-
1That list of hashes for alternative uses would be a good addition to the answer itself. – deceze May 25 '17 at 02:28