I've been tasked with consolidating about 15 years of records from a laboratory, most of which is either student work or raw data. We're talking 100,000+ human-generated files.
My plan is to write a Python 2.7 script that will map the entire directory structure, create checksums for each, and then flag duplicates for deletion. I'm expecting probably 10-25% duplicates.
My understanding is that MD5 collisions are possible, theoretically, but so unlikely that this is essentially a safe procedure (let's say that if 1 collision happened, my job would be safe).
Is this a safe assumption? In case implementation matters, the only Python libraries I intend to use are:
hashlib
for the checksum;sqlite
for databasing the results;os
for directory mapping