It would depend on the files you're comparing.
A) The worst-case scenario is:
- You have a lot of files which are the same size
- The files are very large
- The files are very similar with differences found in a narrow random location in the file
For example, if you had:
- 100x 2MB files of the same size,
- comparison with each other,
- using binary comparison with
- 50% file read (probability of finding unequal byte in the first half of the file)
Then you would have:
- 10,000 comparisons of
- 1MB which equals
- a total of 10GB of reading.
However, if you had the same scenario but derived the hashes of the files first, you would:
- read 200MB of data from disk ( typically the slowest component in a computer ) distilling to
- 1.6K in memory (using MD5 hasing - 16 byte - security is not important)
- and would read 2N*2MB for final direct binary comparison, where N is the number of duplicates found.
I think this worst-case scenario is not typical though.
B) Typical case scenario is:
- Files are usually different in size
- The files are highly likely to differ near the start of the file - this means direct binary comparison does not typically involve reading the whole file on the bulk of differing files of the same size.
For example, if you had:
- A folder of MP3 files (they don't get too big - maybe no bigger than 5MB)
- 100 files
- checking size first
- at most 3 files the same size (duplicates or not)
- using binary comparison for files of the same size
- 99% likely to be different after 1KBytes
Then you would have:
- At most 33 cases where the length is the same in 3 file sets
- Parallel binary reading of 3 files (or more is possible) concurrently in 4K chunks
- With 0% duplicates found - 33 * 3 * 4K of reading files = 396KB of disk reading
- With 100% multiplies found = 33 * 3 * N, where N is file size (~5MB) = ~495MB
If you expect 100% multiples, hashing won't be any more efficient than direct binary comparison. Given that you should expect <100% multiples, hashing would be less efficient than direct binary comparison.
C) Repeated comparison
This is the exception. Building a hash+length+path database for all files will accelerate repeated comparisons. But the benefits would be marginal. It will require 100% reading of files initially and storage of the hash database. The new file will need to be read 100% then added to the database, and if matched will still require direct binary comparison as a final step of comparison (to rule out hash collision). Even if most files are different sizes, when a new file is created in the target folder, it may match an existing file size, and can be quickly excluded from direct comparison.
To conclude:
- No additional hashes should be used (the ultimate test - binary comparison - should always be the final test)
- Binary comparison is often more efficient on first run when there are many different sized files
- MP3 comparison works well with length then binary comparison.