9

If counting from 1 to X, where X is the first number to have an md5 collision with a previous number, what number is X?

I want to know if I'm using md5 for serial numbers, how many units I can expect to be able to enumerate before I get a collision.

John Lewis
  • 712
  • 7
  • 15
  • Why would you still want to use MD5 these days if there are better alternatives (that are not broken yet)? – Dirk Vollmar Jul 30 '11 at 20:07
  • 3
    If you're concerned about collisions, go with one of the SHA functions. In fact, even if you're not concerned about collisions, go with SHA. They're not worse than MD5 in any appreciable manner, but they are better in several. – Slubb Jul 30 '11 at 20:09
  • Why do you want to hash the serial numbers? For what purpose are they used? – mgronber Jul 31 '11 at 07:31
  • The use case is that I have a set of products that I need to have stamped with an ID code. The client does not want an ID code to start like "1", "2", they wanted something more like "Serial# c4ca4238a0b923820dcc509a6f75849b" which looks official if a little long. My concern was knowing that collision can happen I wanted to ensure that it was unlikely to happen in the lifespan of this product run, which may be millions, if not 100s of million (their hope). MD5 hashing was an easy out for me because it's right there in php and I don't have to come up with something foolproof. – John Lewis Aug 01 '11 at 19:30
  • 1
    It's a bad thing to use md5 in order to generate a serial, it's an hash function, it can have collision because by design nothing is here to garanty you the unicity of the hash. I'm not an expert but you should consider using encryption function for your purpose. – AsTeR Jan 24 '12 at 12:58
  • So what's the answer? How did you end up generating IDs? – alexis Feb 20 '12 at 12:30
  • @JohnLewis it might also interest you to know that those other *official looking* serial numbers are usually in no way random at all. They're often a combination of a few things such as an integer id, a timestamp, a manufacturing location, etc. To hash a serial number just to make it look official seems to defeat the purpose. They are call *serial* numbers and not psudo-random numbers, after all. Oh, and you never have collisions that way. – dcow Aug 19 '13 at 20:48

6 Answers6

6

Theoretically, you can expect collisions for X around 264. For a hash function with an output of n bits, first collisions appear when you have accumulated about 2n/2 outputs (it does not matter how you choose the inputs; sequential integer values are nothing special in that respect).

Of course, MD5 has been shown not to be a good hash function. Also, the 2n/2 is only an average. So, why don't you try it ? Take a MD5 implementation, hash your serial numbers, and see if you get a collision. A basic MD5 implementation should be able to hash a few million values per second, and, with a reasonable hard disk, you could accumulate a few billions of outputs, sort them, and see if there is a collision.

Thomas Pornin
  • 72,986
  • 14
  • 147
  • 189
  • I *could* do it. But let's say for the sake of argument that you are right and that X is 2^64. The program I write to do this would be checking against all previous hashes which is non-trivial disk thrashing. Maybe with a SSD. – John Lewis Aug 01 '11 at 11:23
  • 2
    @John: to search for collisions in a big set of values, the trick is to sort them in ascending order. Then collisions appear as two identical consecutive values. Merge sorting implies only a small number of linear passes; no thrashing involved. Still, 2^64 is quite a lot. – Thomas Pornin Aug 01 '11 at 12:18
  • And your number of 2^n/2 is statistically not accurate although intuitively seems right. – John Lewis Aug 01 '11 at 19:32
  • 4
    You don't need to store or check against 2^64 values to detect cycles, there's a clever algorithm with O(0) space requirements. See this post: http://stevekrenzel.com/articles/md5-cycles – alexis Feb 20 '12 at 12:31
  • @alexis The algorithm you linked is for detecting loops in md5 where the result is fed in as data for the next iteration. That doesn't work for this problem because there is no inherent loop, even at the point of the first collision. – Ted Bigham Jul 02 '15 at 17:32
  • @ThomasPornin: Why would you need to sort the hashes? You could put them in a set or a HashMap to detect collisions. – Eric Duminil Nov 14 '17 at 08:27
2

I can't answer your question, but what you are looking for is a uuid. UUID serial numbers can be unique for millions of products, but you might need to check a database to mitigate the tiny chance of a collision.

Azsgy
  • 3,139
  • 2
  • 29
  • 40
1

As far as i know there are no known collisions in md5 for 2^32 (size of an integer)

Spacefish
  • 96
  • 3
1

I believe no one has done some test on this

Considering that if you have a simple incremental number you don't need to hash it

dynamic
  • 46,985
  • 55
  • 154
  • 231
  • 1
    I want a serial number / ID number that doesn't reveal the total amount of units for user viewing. – John Lewis Aug 01 '11 at 11:18
  • As Atsch mentioned, a [UUID](http://en.wikipedia.org/wiki/Universally_unique_identifier) would be perfect for that. It would also be future-proof if you need to scale beyond 32-bit integers for your IDs. Although you would need to shard much before reaching that amount, which would also be made easier by using UUIDs instead of auto-incrementing IDs. – Marco Roy Nov 05 '15 at 21:16
0

I realize this is an old question but I stumbled upon it, found a much better approach, and figured I'd share it.

You have an upper boundary for your ordinal number N so let's take advantage of that. Let's say N < 232 ≈ 4.3*1010. Now each time you need a new identifier you just pick a random 32-bit number R and concatenate it with R xor N (zero-pad before concatenation). This yields a random looking unique 64-bit identifier which you could denote with just 16 hexadecimal digits.

This approach prevents collisions completely because two identifiers that happen to have the same random component necessarily have distinct xor-ed components.

Bonus feature: you can split such a 64-bit identifier into two 32-bit numbers and xor them with each other to recover the original ordinal number.

OscarJ
  • 413
  • 2
  • 11
0

It really depends on the size of your input. A perfect hash function has collisions every (input_length / hash_length) hashes. If your input is small collisions are fairly unlikely, so far there has only been a single one-block collision.

pezcode
  • 5,490
  • 2
  • 24
  • 37
  • The series is [1...∞], where infinity is the high end. There is clearly a first number in that series that collides with one before it. I just want to know which number that is and if I'm ever likely to get that high. – John Lewis Aug 01 '11 at 11:20