1

I want to create a hash or checksum for each of millions of URLs, such that identical URLs (after sanitizing) have the same hash/checksum.

If I generate SHA-1 (20 bytes) or SHA-256 hashes (32 bytes) of the URLs, and store them as big integers (8 bytes) by XORing each 8-bytes chunk of the hash (C# code example here), then is it still safe from collisions? I've read some people say that it should be fine, but haven't found any credible source.

As I understand, a XOR of [1, 5] and [5, 1] will be same, despite them being different sequences, so the hash XOR technique might result in collisions. In that case, are any of the non-crypto hash algorithms like MurMur, FNV or xxHash better for my use case, which requires least chance of collisions at decent performance (not necessarily the fastest)?

Nick
  • 1,365
  • 2
  • 18
  • 37
  • I suspect this might be off-topic here, but over at [security.se] there is a look at the [cryptological issues of hash truncation](https://security.stackexchange.com/q/72673/220320) that might help you. Note that truncation rather than XORing is discussed, but I have no idea as to which might be better. – Ken Y-N Nov 25 '20 at 00:49
  • 1
    Does this answer your question? [Probability of collision with truncated SHA-256 hash](https://stackoverflow.com/questions/19962424/probability-of-collision-with-truncated-sha-256-hash) – Ken Y-N Nov 25 '20 at 00:51
  • Thanks for those links. I'm specifically looking for guidance on XORing hash bytes rather than truncating, although both may seem as similar issues. – Nick Nov 25 '20 at 00:56
  • 1
    Regardless of the algorithm, if the result is 8 bytes then you have created a 64-bit hash, and even if it is perfectly collision resistant, it still only takes about 2^32 operations to find a collision by brute force, which is practically nothing for security purposes. By "safe" do you mean "unlikely to happen by pure chance" or "unlikely for an attacker to be able to cause"? – Nate Eldredge Nov 25 '20 at 01:30
  • @NateEldredge Makes sense. By "safe" I mean "unlikely to happen by pure chance". The attack vector in my use case (hash of stored URLs) is very low due to no direct data access. It's purely to find identical URLs in a dataset. – Nick Nov 25 '20 at 01:34
  • 4
    If you want to truncate a cryptographic secure hash function down to n bytes then just use the high- or low-order n bytes of the result. XORing parts of the output provides no benefit at all. – President James K. Polk Nov 25 '20 at 17:06
  • 1
    Nate's analysis is correct. Given your definition of safe, you could get away with that with very high statistical reliability if the number of URLs being hashed is orders of magnitude less than 2^32. I'd start getting nervous around 2^24 URLs - it'd probably be ok for most purposes but I wouldn't want a critical aeroplane system programmed that way. What's acceptable depends on whether you can afford a once-in-a-blue-moon failure. The XORing is generally marginally better than truncation with a good but not cryptographic strength hash, but analysis of truncation's close enough to use. – Tony Delroy Nov 25 '20 at 22:49

1 Answers1

0

Short answer: no. 8 bytes (64 bits) of hash would not be very reliable to distinguish "millions" of URLs.

See this table: https://en.wikipedia.org/wiki/Birthday_problem#Probability_table

The table indicates probability of collision per given number of elements. Judgements here all depend on two things: how many elements are in your set, and your definition of "reliable".

For a software that is intended to work repeatedly, you should probably assess:

  • The maximum number of elements (URLs) you expect to be present, particularly if you're progressively adding to the set;
  • The number of distinct datasets (or number of the times) your software will be used with over the expected lifetime of your software.
  • What the acceptable error rate is for your software per period/ or over its lifetime.
  • How expensive it would be to fix an error -- is the hash in memory & you just need to re-run, or are you collecting these persistently in a DB/file and they are expensive to recollect?

There are a few calculators available to help calculate. I've used this one: https://kevingal.com/apps/collision.html Switch it to 'bits' mode rather than buckets.

An example assessment:

  • 10 million URLs added per year
  • Software lifetime is 20 years & run 100 times a year, but it's always the same dataset
  • We accept only a one-in-a-million chance of error.
  • = 200 million URLs
  • => By calculator, a hash of 74 bits would give you ~1.1e-6 chance of collision -- but just go for 128 bits.

Another possible assessment:

  • 50 million URLs or elements in each dataset
  • Software lifetime is 20 years & run 100 times a year, each time with an independent dataset
  • We accept only a one-in-a-million lifetime chance of error.
  • = 50 million URLs
  • = Software needs to run 2000 times, with an lifetime probability of one-in-a-million
  • => Therefore probability of error on one run needs to be ~5e-10.
  • => By calculator, a hash of 81 bits would give you 5.2e-10 chance of collision -- just go for 128 bits.

Conclusion: just keep it simple & robust, use 128 bits of hash.

A few last words: XOR'ing is neither necessary or desirable; just truncate the hash to the number of bytes you are going to use.

If robust uniqueness is important, I suggest you stick with a cryptographic grade hash such as SHA-256 or SHA-1. For security purposes, SHA-256 is stronger (more resistant to cryptographic attacks) than SHA-1. Both these algorithms are still very fast.

Thomas W
  • 13,940
  • 4
  • 58
  • 76