7

I wonder if it is 'safe' to hash a bunch of MD5 hash values together to create a new hash or whether this will in any way increase the probability of collisions.

The background: I have a couple of files with dependencies. Each file has an associated hash value which is calculated based on it's content. Let's call this the 'single-file' hash value. In addition to this, the file should also have a hash value which includes all the dependent files, the 'multi-file' hash value.

So the question is: Can I just take all the single-file MD5 hash values of the dependent files, concatenate them and then calculate an MD5 over the concatenated values to get the multi-file hash value. Or will this result in an MD5 hash that is more likely to collide than if I would concatenate the content of all dependent files together.

Alternatively, could I xor the single-file hash values together to generate a multi-file hash value, or would this likely result in more collisions?

Janick Bernet
  • 20,544
  • 2
  • 29
  • 55

3 Answers3

3

Sounds like you need a Merkel Tree

James
  • 9,064
  • 3
  • 31
  • 49
1

MD5 has a lot of collision problems, see MD5 entry on Wikipedia.

However, if you use MD5 not for security but as a unique marker to check dependencies, even hashing contatenated hashes should be pretty safe.

Or, if it's not too late, switch to SHA-1.

squadette
  • 8,177
  • 4
  • 28
  • 39
  • From my understanding, the collision problems are mainly relevant under the assumption of an active attacker, who's sole purpose is to provoke a collision, but not that random collisions are considerably more likely than using SHA-1. Since it's solely to generate a unique id and not for security purposes, active attackers are a non-issue and performance is the main concern. – Janick Bernet Sep 18 '11 at 12:53
  • For checksums (and it's pretty obvious from the question that this is what OP is doing), forged collisions don't matter. Only the last part of the second sentence adresses the question at all, and it definitely needs elaboration - how did you arrive at this conculsion? –  Sep 18 '11 at 12:54
  • I believe that you should be pretty safe then. You're mapping files to points in 2^128 space. Then you're mapping another small file (contatenation) to another point in 2^128 space. If you trust MD5 to hash your files uniformly — you should trust it to hash your concatenations just as uniformly. – squadette Sep 18 '11 at 12:56
  • @squadette: It could be (especially given that its easy to artificially construct collisions) that an MD5 hash itself has some properties that makes them poor bases for hashing again, using MD5. For instance it could *theoretically* be that the infinite application of MD5 hashes converges to a specific hash value. – Janick Bernet Sep 18 '11 at 13:01
1

I think the risks of a collision is about the same for hashing the concatenated files, as to hashing the concatenated file hashes.

Gerben
  • 16,747
  • 6
  • 37
  • 56
  • I *think* so, too. :) However, I'd like to better be safe than sorry and end up with tons of collisions all the time :) – Janick Bernet Sep 18 '11 at 13:49
  • Hashing is all about chance. If you want to be 100% sure, you shouldn't use hashes. Just know that hashes are designed to be as randomly distributed as possible to chances of collisions are very near 1/hashsize (1/2^128 for md5). You won't get tons of collisions, but you code should not break sometimes a collision occurs. – Gerben Sep 18 '11 at 13:53
  • I'm not afraid of a 1:1^128 collision chance, but, as I wrote to squadette, theoretically applying MD5 recursively could converges to a specific hash value, who am I to know. Hence I'd like to get some info by people knowing the algorithm, if it has such problems or not. – Janick Bernet Sep 18 '11 at 14:05
  • First you are not hashing a hash, but hashing a string of concatenated hashes. Second you are only going 2 levels deep, not like rehashing a 1000 times. Converging will not occur after just 2 round. – Gerben Sep 18 '11 at 14:30
  • True, I'm only going 1 level deep in my example, maybe I should generalize the question more. But if it would converge, going one level deep would definitely increase the probability of collisions, it's just a question of by how much. As I said, I do not believe it to be an issue, but I would still like to get some mathematical facts, or at least some inside knowledge of people knowing the inner workings of MD5, that it's save to use it this way. – Janick Bernet Sep 18 '11 at 14:47