There is a hard disk with lots of file, how will you find duplicate files among them.
First thing we could do is separate files on the basis of FILE_SIZE.
Then we could find hash value of each file using some algorithm like MD5, one with the same hash would be duplicates.
Can anyone tell about some other approaches to segregate candidates for duplicates files, apart from using FILE_SIZE. maybe using file headers, extensions, or any other idea?