I'd like to find data deduplication algorithms, mostly to find duplicate files. Looks like the first step is to identify the files with the same timestamps, sizes and file names. I can do an MD5 checksum on those files and compare. In addition to that it is possible to compare the contents of files. What else should I watch for?
3 Answers
You have OS meta-information (size and timestamps). Other meta-information includes permissions. You could compare inode and dnode information, but that doesn't mean much.
You have a summary (checksum).
You have byte-by-byte details.
What else could there be? Are you asking for other summaries? A summary is less informative than the byte-by-byte details. But you could easily invent lots of other summaries. A summary is only useful if you save it somewhere so you don't recompute it all the time.
If you want to save summaries for the "master" copy, you can invent any kind of summary you want. Line counts, letter "e" counts, average line length, anything is an potentially interesting summary.

- 384,516
- 81
- 508
- 779
Md5 has collision problems (two files with same md5 may still have different contents.)
If you perform a SHA-1 hash on each file and compare the hashes, only files with the exact same content will have the same hash. Period.
This also helps by ignoring whether they have different names, modification dates, etc.
Some people go the extra mile and use sha-256, but it is really unnecessary. Most of the commercial deduplication appliances rely on SHA-1 (also referred to as SHA-160).
If you use SHA-1 to compare the files, you don't need anything else.
I know this because I have worked with different deduplication systems and vendors for a number of years and I have also written a sample deduplication system.

- 21
- 2
-
1That's not exactly correct: SHA-1, giving only 2^160 possible hashes, certainly makes it *highly unlikely* to suffer a collision, but not actually impossible. Still, it's generally the case that just comparing hashes will get you good results rather faster than byte-by-byte, so +0 overall. – Nathan Tuggy Jan 04 '15 at 07:38
There are products available for this. Look for Duplicate File Detective. It can match by name, timestamp, md5 and other algorithms

- 2,925
- 4
- 24
- 28