4

I've been tasked with consolidating about 15 years of records from a laboratory, most of which is either student work or raw data. We're talking 100,000+ human-generated files.

My plan is to write a Python 2.7 script that will map the entire directory structure, create checksums for each, and then flag duplicates for deletion. I'm expecting probably 10-25% duplicates.

My understanding is that MD5 collisions are possible, theoretically, but so unlikely that this is essentially a safe procedure (let's say that if 1 collision happened, my job would be safe).

Is this a safe assumption? In case implementation matters, the only Python libraries I intend to use are:

  • hashlib for the checksum;
  • sqlite for databasing the results;
  • os for directory mapping
Jonline
  • 1,677
  • 2
  • 22
  • 53
  • Yeah, I'm going to infer from the existence of that question and the variety of answers that it must, indeed, be safe to do this. I'll second closing this, thanks. – Jonline Jun 03 '14 at 18:56
  • 1
    You may also find this discussion of MD5 collisions relevant: http://crypto.stackexchange.com/questions/1434/are-there-two-known-strings-which-have-the-same-md5-hash-value . – user3499545 Jun 03 '14 at 18:57
  • 1
    If you want to be extra paranoid, you can use a better hash, such as SHA-512 (also available in hashlib). Collisions are even more astronomically unlikely, and it's infeasible to produce a collision even if one wanted to (that's not true of MD5). –  Jun 03 '14 at 18:57
  • 1
    Out of curiosity, why would you use MD5 rather than SHAx? – Spencer Ruport Jun 03 '14 at 18:57
  • 1
    Yes, a 128-bit hash space is large enough that a collision with fewer than a million files is astronomically unlikely. If you're worried, you can move to a larger hash space (SHA-512 has been mentioned), and go back and test the probable duplicates for actual equality. – Sneftel Jun 03 '14 at 18:59
  • @SpencerRuport I'd be lying if I said I had a reason beyond habit; I wrote this little script a year ago to do this with my music files (of which there were dramatically fewer and none of which were truly irreplacable); was hoping to reuse it, basically. If SHAx doesn't dramatically ramp up the processing time, I may indeed take the safer route, even if it is overkill. – Jonline Jun 04 '14 at 21:11

2 Answers2

13

The probability of finding an md5 collision between two files by accident is:

0.000000000000000000000000000000000000002938735877055718769921841343055614194546663891

the probability of getting hit by 15km size asteroid is 0.00000002. I'd say yes.

Backing up the files and well testing the script remains a good advice, human mistakes and bugs are more luckily to happen.

Community
  • 1
  • 1
  • But there are almost 5 billion ways to choose pairs of 100,000 items. – David Ehrmann Jun 03 '14 at 20:59
  • Best answer to a Stack question I've ever received, haha. Don't suppose you have a comment on how much of a file should be necessary to use as an identifying signature? My instincts say "all of it if you want to be sure", but reason tells me that hashing every bite on several hard drives is pretty intensive. – Jonline Jun 04 '14 at 20:47
  • @Jonline I believe you want to hash *part* of the file? this is dangerous since many same type binary formats share some common header, it's very possible you have two files with similar headers thus similar signature. –  Jun 04 '14 at 21:14
  • I'd seen comments elsewhere on StackOverflow, while researching this, that suggested (this was in the context of media files like video) that hashing kb of file would be enough to reach degree of certainty. Was just curious if you had opinion on the matter! – Jonline Jun 04 '14 at 21:27
1

The recent researches about MD5 collisions may have baffled you because in 2013 some people gave algorithms to generate MD5 collisions in 1 second on a normal computer however I assure you that this does not nullify the use of MD5 for checking file integrity and duplicacy. It is highly unlikely that you'll get two normal-use files with the same hash unless obviously you're deliberately looking for trouble and put up binary files with the same hash. If you're still getting paranoid then I advice you to use larger key-space hash functions like SHA-512.

Abdul Fatir
  • 6,159
  • 5
  • 31
  • 58