0

Apologies if this is a duplicate question; most of those I've found are over my head, so I may have missed the answer.

For a given hash, say MD5 (128 bits), what is the chance of a hash collision with 10^12 of them?

My maths is not great, I've come up with this equation (I think it's correct) but have no idea how to solve it:

Collision_Chance = 1 - (1 - (1 / 2^128) ) ^ (10^12)

I'm guessing it's somewhere around 10^-26, does this sound about right?

Thanks

Edit: I think my estimate is very wrong. See Birthday Paradox

Jodes
  • 14,118
  • 26
  • 97
  • 156
  • "birthday paradox" is the google term. And the chance is about SQRT(n) , in your case 128 bits -->> 64 bits. And 10^12 is less than 64 bits. – wildplasser Jan 11 '14 at 13:36
  • @wildplasser “The chance is about SQRT(n)” – what does that mean? A chance should be a number between 0 and 1. sqrt(n) is the number of values such that the probability for a collision is 1/2. – Christopher Creutzig Jan 11 '14 at 14:02
  • Sloppy wording. If you have a keyspace of 128 bits, and you sample randomly from that, the chance of a collision (in the cumulated sample) passes the .5 limit when you have about 2^64 items in your set. Your wikipedia-link has the correct wording (and formulas) – wildplasser Jan 11 '14 at 14:07

2 Answers2

2

What does your formula say for having 2^128 + 1 values? I believe it does not say that the collision probability is 1, so it cannot be right. actually, I know it is not – the correct formula is rather large and unwieldy, but there are good approximations using the exponential of a fraction. SO does not typeset formulas, so I won’t try and write the formulas down here.

The best key word to search for is probably “birthday attack”.

Christopher Creutzig
  • 8,656
  • 35
  • 45
  • `What does your formula say for having 2^128 + 1 values?` I don't think it's saying that?! If someone can explain how to come to the correct probability of the above example, approximation or otherwise, that would be great. – Jodes Jan 11 '14 at 13:32
  • What I meant is: Assume you have 2^128 + 1 hash values. What does your formula say the collision probability is? (It should be 1.) (And my answer contains a link pointing to a correct approximation formula.) – Christopher Creutzig Jan 11 '14 at 13:59
  • Thanks. The approximation formula was exactly what I wanted. – Jodes Jan 11 '14 at 14:21
0

Why would a hash collision be a problem? Hashes are never designed to generate unique vaues, only to facillitate a fast first comparison.

If you are having trouble with hash collisions, you're using it wrong.

oɔɯǝɹ
  • 7,219
  • 7
  • 58
  • 69
  • -1 This is simply wrong. Many hash function *are* designed to make collisions vanishingly unlikely, and hash functions with these properties have many great applications. See http://en.wikipedia.org/wiki/Cryptographic_hash_function for definitions and examples. –  Jan 11 '14 at 13:26
  • @delnan `unlikely` is the key word here. A hash function is always either: a literal copy of the item (and thus unique) or a operation that ultimatly simplifies the data (and therefore a kind of lossy compression). There is no guarantee, and no intention to make generated hashes unique. – oɔɯǝɹ Jan 11 '14 at 13:29
  • You appear to be invoking the pigeonhole principle. That principle is true, but it does not mean one must never design a system that would falter in the face of hash collisions (which you're apparently saying). That would be at odds with modern cryptography practice and several other fields, which rely on hash collisions (for certain well-chosen hash functions) being less likely than, for example, a cosmic ray flipping the result of the comparison. In fact, these systems often don't have, or logically can't have, the full original value to compare. –  Jan 11 '14 at 13:34
  • How can an algorithm be correct when it is known to fail on certain inputs (especially when you are not in control of the inputs?) – oɔɯǝɹ Jan 11 '14 at 13:37
  • You can duke this out with the cryptographers of the world, but I'd say for now I have presented sufficient evidence to support that your answer is wrong ;-) –  Jan 11 '14 at 13:39
  • What part of my answer is 'wrong'? I'm only trying to suggest that the original question might imply an incorrect understanding or application of hash codes. I didn't bring cryptography and practical trade-offs into the dicussion... – oɔɯǝɹ Jan 11 '14 at 13:43
  • My point is, your claim that hash collisions being a potential problem means "you're using it wrong" (which, let's be honest here, is pretty much the whole answer) is not sustainable. It contradicts the extremely wide-spread use of hash functions for purposes that would be broken by collisions, yet work just fine despite the theoretic possibility of collisions. –  Jan 11 '14 at 13:46
  • Ah, ok. That's a whole different kind of reasoning you're using now. I can see your point. I'm only trying to tell that when using hash codes that you need to deal with the fact that they can and will collide in practice, how unlikely in theory it may be. Why otherwise, whould a hash table use buckets?There are enough examples of wrong usage of hash codes that i want to warn against... – oɔɯǝɹ Jan 11 '14 at 13:51
  • Because everything that makes collisions unlikely enough to not bother cannot be applied to hash tables: It's incompatible with the space and performance constraints of such data structures. The hash functions are weaker (usually not cryptographically strong), the domain of the hash function is dozens of orders of magnitude smaller, and so on. In hash tables, collisions are likely (virtually guaranteed) and must be handled -- theory and practice agree here, too. You're right that not every use of hash functions can ignore collisions, but a blanket statement is just misleading. –  Jan 11 '14 at 13:57
  • We used to think that SHA1 was cryptographicly strong as well, but see where that got us ... :-) – oɔɯǝɹ Jan 11 '14 at 14:00