0

I am looking to generate unique 16-byte / 128-bit hased IDs (GUIDs) that does not need to be cryptographically secure. For example imagine that the hash is the 128-bit MD5("some user generated strings");

I would have preferred to use a SHA algorithm, but SHA doesn't come in a 128-bit variant AFAIK. The older MD5 generates a 128-bit hash which is exactly what I need.

But since the SHA algorithm is presumably newer / better than the MD5 algorithm, what would yield the best result:

  1. Using MD5?
  2. Or using SHA-256 and XOR'ing the two 16-byte halves together to get a 128-bit hash?
  3. Or simply using the first 128 bits of SHA-1 or SHA-256 (this is answered in other Stackoverflow questions e.g. here Using N first bits of a hash function to have an N-bit hash)

Would 3 e.g. be better than 2? Or are they equally good?

I have no clue about the inner workings of SHA, so my question might be totally off, please help enlighten me. Thanks!

  • If they not need to be cryptographically secure, what do you mean with 'better' or 'good'? – g_uint Sep 01 '22 at 07:29
  • I suppose a combination of speed and collission probability. AFAIU SHA might calculate faster than MD on modern CPUs. And I think SHA has a slightly better collission risk. But I'm unsure what happens if I choose solution 2 or 3 over using MD5. – Michael Seifert Sep 01 '22 at 07:34
  • Unfortunately can't say anything about either, but the speed requirement you can benchmark yourself quite easily. Just do both implementations and time them. But if the application is not latency sensitive I would not worry about speed initially. – g_uint Sep 02 '22 at 08:49

1 Answers1

0

The easiest way to get a 16 byte ID is to generate 16 random bytes. Hashing can only decrease the quality of the ID, but maybe you are mixing it up with encoding?

Encoding can format the 16 random bytes to a readable string. If you need to encode the ID to store it as string in a db, then you would better use Base64 for a compact format, or HexEncode to get something similar to a GUID (this is what most MD5 functions use after hashing, to get a readable string).

martinstoeckli
  • 23,430
  • 6
  • 56
  • 87
  • The problem isn't generating a random ID. The problem is a hash problem. I need to hash arbitrary values into a 128 bit hash. Sorry if that wasn't entirely clear. I'll see if I can sharpen my question – Michael Seifert Sep 01 '22 at 07:22
  • @MichaelSeifert - So then you are looking to generate some kind of fingerprint, to get an ID depending on the content? Should the ID change if the content changes? Is it used to quickly refind e.g. a row in a database? – martinstoeckli Sep 01 '22 at 07:34
  • Yes, that's a great description of the problem :-) – Michael Seifert Sep 01 '22 at 07:51
  • @MichaelSeifert - So then nothing speaks against using MD5, you need 128 bit output after all. MD5 is still a good hashing algorithm, creating a collision is extremely difficult and can realistically only be produced under laboratory conditions. Speed is an advantage in your case and MD5 is extremely fast (MD5 owes its bad reputation for password hashing because of this). And security is not a problem here. If you are concerned about collisions (which is not necessary), then you would have to enlarge the output to more than 128 bits. – martinstoeckli Sep 01 '22 at 08:33
  • If it is a problem, that somebody can forge a content (with evil intentions), which would lead to a collision (same hash as another content), then this would speak against MD5. It is extremely difficult to do, but if one can choose the content freely to the last byte, it can be done. – martinstoeckli Sep 01 '22 at 08:40