3

There is a hard disk with lots of file, how will you find duplicate files among them.
First thing we could do is separate files on the basis of FILE_SIZE.
Then we could find hash value of each file using some algorithm like MD5, one with the same hash would be duplicates.

Can anyone tell about some other approaches to segregate candidates for duplicates files, apart from using FILE_SIZE. maybe using file headers, extensions, or any other idea?

user2328404
  • 423
  • 1
  • 4
  • 9

1 Answers1

2

You may want to use multiple levels of comparisons, with the fast ones coming first to avoid running the slower ones more than necessary. Suggestions:

  1. Compare the file lengths.

  2. Then compare the first 1K bytes of the files.

  3. Then compare the last 1K bytes of the files. (First and last parts of a file are more likely to contain signatures, internal checksums, modfication data, etc, that will change.)

  4. Compare the CRC32 checksums of the file. Use CRC rather than a cryptographic hash, unless you have security measures to be concerned about. CRC will be much faster.

B-Con
  • 454
  • 6
  • 15