Is it safe to use MD5 checksums to search for duplicate files across multiple hard drives?

Question

I've been tasked with consolidating about 15 years of records from a laboratory, most of which is either student work or raw data. We're talking 100,000+ human-generated files.

My plan is to write a Python 2.7 script that will map the entire directory structure, create checksums for each, and then flag duplicates for deletion. I'm expecting probably 10-25% duplicates.

My understanding is that MD5 collisions are possible, theoretically, but so unlikely that this is essentially a safe procedure (let's say that if 1 collision happened, my job would be safe).

Is this a safe assumption? In case implementation matters, the only Python libraries I intend to use are:

hashlib for the checksum;
sqlite for databasing the results;
os for directory mapping

Yeah, I'm going to infer from the existence of that question and the variety of answers that it must, indeed, be safe to do this. I'll second closing this, thanks. — Jonline, Jun 03 '14 at 18:56
You may also find this discussion of MD5 collisions relevant: http://crypto.stackexchange.com/questions/1434/are-there-two-known-strings-which-have-the-same-md5-hash-value . — user3499545, Jun 03 '14 at 18:57
If you want to be extra paranoid, you can use a better hash, such as SHA-512 (also available in hashlib). Collisions are even more astronomically unlikely, and it's infeasible to produce a collision even if one wanted to (that's not true of MD5). — , Jun 03 '14 at 18:57
Yes, a 128-bit hash space is large enough that a collision with fewer than a million files is astronomically unlikely. If you're worried, you can move to a larger hash space (SHA-512 has been mentioned), and go back and test the probable duplicates for actual equality. — Sneftel, Jun 03 '14 at 18:59
@SpencerRuport I'd be lying if I said I had a reason beyond habit; I wrote this little script a year ago to do this with my music files (of which there were dramatically fewer and none of which were truly irreplacable); was hoping to reuse it, basically. If SHAx doesn't dramatically ramp up the processing time, I may indeed take the safer route, even if it is overkill. — Jonline, Jun 04 '14 at 21:11

score 13 · Accepted Answer · edited May 23 '17 at 12:15

13

The probability of finding an md5 collision between two files by accident is:

0.000000000000000000000000000000000000002938735877055718769921841343055614194546663891

the probability of getting hit by 15km size asteroid is 0.00000002. I'd say yes.

Backing up the files and well testing the script remains a good advice, human mistakes and bugs are more luckily to happen.

edited May 23 '17 at 12:15

Community

1
1

answered Jun 03 '14 at 19:36

But there are almost 5 billion ways to choose pairs of 100,000 items. – David Ehrmann Jun 03 '14 at 20:59
Best answer to a Stack question I've ever received, haha. Don't suppose you have a comment on how much of a file should be necessary to use as an identifying signature? My instincts say "all of it if you want to be sure", but reason tells me that hashing every bite on several hard drives is pretty intensive. – Jonline Jun 04 '14 at 20:47
@Jonline I believe you want to hash *part* of the file? this is dangerous since many same type binary formats share some common header, it's very possible you have two files with similar headers thus similar signature. – Jun 04 '14 at 21:14
I'd seen comments elsewhere on StackOverflow, while researching this, that suggested (this was in the context of media files like video) that hashing kb of file would be enough to reach degree of certainty. Was just curious if you had opinion on the matter! – Jonline Jun 04 '14 at 21:27

score 1 · Answer 2 · answered Jun 03 '14 at 21:52

The recent researches about MD5 collisions may have baffled you because in 2013 some people gave algorithms to generate MD5 collisions in 1 second on a normal computer however I assure you that this does not nullify the use of MD5 for checking file integrity and duplicacy. It is highly unlikely that you'll get two normal-use files with the same hash unless obviously you're deliberately looking for trouble and put up binary files with the same hash. If you're still getting paranoid then I advice you to use larger key-space hash functions like SHA-512.

Is it safe to use MD5 checksums to search for duplicate files across multiple hard drives?

2 Answers2