6

I'd like to detect duplicate files in a directory tree. When two identical files are found only one of the duplicates will be preserved and the remaining duplicates will be deleted to save the disk space.

The duplicate means files having the same content which may differ in file names and path.

I was thinking about using hash algorithms for this purpose but there is a chance that different files have the same hashes, so I need some additional mechanism to tell me that the files aren't the same even though the hashes are the same because I don't want to delete two different files.

Which additional fast and reliable mechanism would you use?

xralf
  • 3,312
  • 45
  • 129
  • 200
  • 1
    The chance of a hash collison is seriously slim. If you want 100% certainty beyond that, you can just compare the full file contents - it'll be rare enough performance doesn't matter. –  Mar 21 '12 at 16:02
  • 1
    @delnan: This is not true. The chances for a collision of a specific file are low, for a large files collections - the chances are much higher - see the [birthday paradox](http://en.wikipedia.org/wiki/Birthday_problem) as an example. The probability of two people out of 23 having the birthday at the same day is 23. The chances for a collision grows exponentially as the collection gets bigger. – amit Mar 21 '12 at 16:04
  • 1
    @amit I'm aware of the birthday paradoxon, it's also why I don't say "the odds are so low you shouldn't bother checking". Also, my gut feeling says the chances are so low for two files, it would take hundreds or thousands of files to have collision odds >1. But yeah, I'd better check that first. The table in that article (regarding [Birthday attack](http://en.wikipedia.org/wiki/Birthday_attack)) seems to confirm this. If I'm reading it right, a perfect 64-bit hash requires `1.9 × 10^8` (= 190 million) files even for a **0.1%** collision chance. –  Mar 21 '12 at 16:15

7 Answers7

22

Calculating hash will make your program run slow. Its better you also check the file size. All the duplicate file should have same file size. If they share same file size apply hash check. It'll make your program perform fast.

There can be more steps.

  1. Check if file size is equal
  2. If step 1 passes, check if first and last range of bytes (say 100 bytes) are equal
  3. If step 2 passes, check file type,
  4. If step 3 passes, check the hash at last

The more criteria you add the more faster it'll perform and you can avoid the last resort (hash) this way.

Shiplu Mokaddim
  • 56,364
  • 17
  • 141
  • 187
  • 1
    You can also compute two hashes (different hashes!) of the file and consider them the same only if both hashes are equal. – Vatine Mar 21 '12 at 16:12
  • @vatine You mean `if(md5("some_file")==sha1("some_file"))` they are same? – Shiplu Mokaddim Mar 21 '12 at 16:17
  • 4
    No, if `md5("filea") == md5("fileb")` and `sha1("filea") == sha1("fileb")` you have a much stronger guarantee of filea and fileb bein identical than if you only use one hash function. – Vatine Mar 21 '12 at 16:20
  • @Vatine in that case I think using a single hash with longer bit. Say `sha256` or `sha512` – Shiplu Mokaddim Mar 21 '12 at 16:33
  • You can still get a collision (admittedly, you'd need 2^128 or 2^256 files before that hits 50%), but with two different hashes, that's much less likely. – Vatine Mar 22 '12 at 10:31
  • 2
    One thing to note: if the most frequent case is for there to be lots of duplicates (more specifically a high ratio between duplicates and non-duplicates) then calculating the hash only as step 4 isn't going to be that much slower than skipping the first 3 steps. For instance, if importing images from a digital camera where old images are not deleted. – Michael Jul 22 '13 at 22:28
  • 1
    You shouldn't need to hash at all, assuming most files are not similar in the first file block. Hashing might become more important for increasingly larger sets of files when checked multiple times however. You are looking to falsify, not confirm - as soon as a byte at a position differs. Only if there are too many files to cache the first block in memory, then you should start with a hash of the first block of the file, that should drastically reduce your search set for final byte-level falsifying. Hashing a whole file is potentially a time-consuming process despite the benefits. – Kind Contributor Mar 28 '16 at 08:19
3

It would depend on the files you're comparing.

A) The worst-case scenario is:

  1. You have a lot of files which are the same size
  2. The files are very large
  3. The files are very similar with differences found in a narrow random location in the file

For example, if you had:

  • 100x 2MB files of the same size,
  • comparison with each other,
  • using binary comparison with
  • 50% file read (probability of finding unequal byte in the first half of the file)

Then you would have:

  • 10,000 comparisons of
  • 1MB which equals
  • a total of 10GB of reading.

However, if you had the same scenario but derived the hashes of the files first, you would:

  • read 200MB of data from disk ( typically the slowest component in a computer ) distilling to
  • 1.6K in memory (using MD5 hasing - 16 byte - security is not important)
  • and would read 2N*2MB for final direct binary comparison, where N is the number of duplicates found.

I think this worst-case scenario is not typical though.

B) Typical case scenario is:

  1. Files are usually different in size
  2. The files are highly likely to differ near the start of the file - this means direct binary comparison does not typically involve reading the whole file on the bulk of differing files of the same size.

For example, if you had:

  • A folder of MP3 files (they don't get too big - maybe no bigger than 5MB)
  • 100 files
  • checking size first
  • at most 3 files the same size (duplicates or not)
  • using binary comparison for files of the same size
  • 99% likely to be different after 1KBytes

Then you would have:

  • At most 33 cases where the length is the same in 3 file sets
  • Parallel binary reading of 3 files (or more is possible) concurrently in 4K chunks
  • With 0% duplicates found - 33 * 3 * 4K of reading files = 396KB of disk reading
  • With 100% multiplies found = 33 * 3 * N, where N is file size (~5MB) = ~495MB

If you expect 100% multiples, hashing won't be any more efficient than direct binary comparison. Given that you should expect <100% multiples, hashing would be less efficient than direct binary comparison.

C) Repeated comparison

This is the exception. Building a hash+length+path database for all files will accelerate repeated comparisons. But the benefits would be marginal. It will require 100% reading of files initially and storage of the hash database. The new file will need to be read 100% then added to the database, and if matched will still require direct binary comparison as a final step of comparison (to rule out hash collision). Even if most files are different sizes, when a new file is created in the target folder, it may match an existing file size, and can be quickly excluded from direct comparison.

To conclude:

  • No additional hashes should be used (the ultimate test - binary comparison - should always be the final test)
  • Binary comparison is often more efficient on first run when there are many different sized files
  • MP3 comparison works well with length then binary comparison.
Kind Contributor
  • 17,547
  • 6
  • 53
  • 70
1

The hash solution is fine - you will just need to do one of the collisions mechanisms for dealing with 2 elements that are hashed to the same value. [chaining or open addressing].

Just iteratively add elements - if your implementation detected that there is a dupe - it will not add it to the hash set. You will know that an element is a dupe if the size of the set was not changed after trying to add the element.

Most likely that there is already an implementations for this kind of data structure, in your language - for example a HashSet in java and unordered_set in C++.

amit
  • 175,853
  • 27
  • 231
  • 333
1

If you use a hash algo like SHA-1 or better yet SHA-256 or higher, I really doubt if you will get the same hash value for two different files. SHA is a cryptographic hash function, and is used in version control systems like Git. So you can rest assured that it will do the job for you.

But if you still want additional checks in place, you can follow these two steps.
1) Parse the headers - this is a really tough cookie to crack since different formats might have different header lengths
2) Have some sanity checks - file size, read random file positions and try to check if they are the same

Neo
  • 1,554
  • 2
  • 15
  • 28
1

This is the typical output of a md5sum:

0c9990e3d02f33d1ea2d63afb3f17c71

If you don't have to fear intentionally faked files, the chances to for a second, random file to match is

1/(decimal(0xfffffffffffffffffffffffffffffff)+1)

If you take the file size into account as additional test, your certainty increases, that both files fit. You might add more and more measurement, but a bitwise comparison will be the last word in such a debate. For practical purpose, md5sum should be enough.

user unknown
  • 35,537
  • 11
  • 75
  • 121
  • This is not true. The chances for a collision of a specific file are low [`1/(decimal(0xfffffffffffffffffffffffffffffff)+1)`], for a large files collection - the chances are much higher - see the [birthday paradox](http://en.wikipedia.org/wiki/Birthday_problem) as an example. The probability of two people out of 23 having the birthday at the same day is 23. The chances for a collision grows exponentially as the collection gets bigger. – amit Mar 21 '12 at 16:03
  • `the chances to for a second, random file to match is` means exactly that: One file, and a second file. How many files do you have? How big is the chance of a collision for 10 000 files? 100 000 files? – user unknown Mar 21 '12 at 16:45
  • I agree to that, but I doubt the claim `For practical purpose, md5sum should be enough.` is also true - especially if we are talking about distributed [and huge] file system... It *could* be true - but this claim should be proven. – amit Mar 21 '12 at 16:48
  • @amit: There are about 50M possible pairs of 10k files, and 50G pairs from 100k files. Divide the number by `340 282 366 920 938 463 463 374 607 431 768 211 456` which is 16^32, the size of the md5sum spectrum, and compare the probability of a collision with the probability to get the file size check wrong. To any other bug in checking. – user unknown Mar 21 '12 at 17:30
0
/// ------------------------------------------------------------------------------------------------------------------------
    /// <summary>
    /// Writes duplicate files to a List<String>
    /// </summary>
    private void CompareDirectory(string[] files)
    {
        for (int i = 0; i < files.Length; i++)
        {
            FileInfo one = new FileInfo(files[i]); // Here's a spot for a progressbar or something

            for (int i2 = 0; i2 < files.Length; i2++) 
            {
                if (i != i2 && !duplicatePathsOne.Contains(files[i2])) // In order to prevent duplicate entries
                {
                    FileInfo two = new FileInfo(files[i2]);
                    if (FilesAreEqual_OneByte(one, two))
                    {
                        duplicatePathsOne.Add(files[i]);
                        duplicateNamesOne.Add(Path.GetFileName(files[i]));
                        duplicatePathsTwo.Add(files[i2]);
                        duplicateNamesTwo.Add(Path.GetFileName(files[i2]));
                    }
                }
            }
        }
    }


/// ------------------------------------------------------------------------------------------------------------------------
    /// <summary>
    /// Compares files by binary
    /// </summary>
    private static bool FilesAreEqual_OneByte(FileInfo first, FileInfo second)
    {
        if (first.Length != second.Length)
            return false;

        using (FileStream fs1 = first.OpenRead())
        using (FileStream fs2 = second.OpenRead())
        {
            for (int i = 0; i < first.Length; i++)
            {
                if (fs1.ReadByte() != fs2.ReadByte())
                    return false;
            }
        }

        return true;
    }
0

Here is a VBS script that will generate CSV file to show duplicate files in a folder based on MD5 file checksum and file size.

Set fso = CreateObject("Scripting.FileSystemObject")
Dim dic: Set dic = CreateObject("Scripting.Dictionary")
Dim oMD5:  Set oMD5 = CreateObject("System.Security.Cryptography.MD5CryptoServiceProvider")
Dim oLog 'As Scripting.TextStream

Set oArgs = WScript.Arguments

If oArgs.Count = 1 Then
    sFolderPath = GetFolderPath()
    Set oLog = fso.CreateTextFile(sFolderPath & "\DublicateFiles.csv", True)
    oLog.Write "sep=" & vbTab & vbCrLf
    CheckFolder oArgs(I)
    oLog.Close
    Msgbox "Done!"
Else
    Msgbox "Drop Folder"
End If

Sub CheckFolder(sFolderPath)
    Dim sKey
    Dim oFolder 'As Scripting.Folder
    Set oFolder = fso.GetFolder(sFolderPath)

    For Each oFile In oFolder.Files
        'sKey = oFile.Name & " - " & oFile.Size
        sKey = GetMd5(oFile.Path) & " - " & oFile.Size

        If dic.Exists(sKey) = False Then 
            dic.Add sKey, oFile.Path
        Else
            oLog.Write oFile.Path & vbTab & dic(sKey) & vbCrLf
        End If
    Next

    For Each oChildFolder In oFolder.SubFolders
        CheckFolder oChildFolder.Path
    Next
End Sub

Function GetFolderPath()
    Dim oFile 'As Scripting.File
    Set oFile = fso.GetFile(WScript.ScriptFullName)
    GetFolderPath = oFile.ParentFolder
End Function

Function GetMd5(filename)
    Dim oXml, oElement

    oMD5.ComputeHash_2(GetBinaryFile(filename))

    Set oXml = CreateObject("MSXML2.DOMDocument")
    Set oElement = oXml.CreateElement("tmp")
    oElement.DataType = "bin.hex"
    oElement.NodeTypedValue = oMD5.Hash
    GetMd5 = oElement.Text
End Function

Function GetBinaryFile(filename)
    Dim oStream: Set oStream = CreateObject("ADODB.Stream")
    oStream.Type = 1 'adTypeBinary
    oStream.Open
    oStream.LoadFromFile filename
    GetBinaryFile= oStream.Read
    oStream.Close
    Set oStream = Nothing
End Function

Here is a VBS script that will generate CSV file to show duplicate files in a folder based on file name and size.

Set fso = CreateObject("Scripting.FileSystemObject")
Dim dic: Set dic = CreateObject("Scripting.Dictionary")
Dim oLog 'As Scripting.TextStream

Set oArgs = WScript.Arguments

If oArgs.Count = 1 Then
    sFolderPath = GetFolderPath()
    Set oLog = fso.CreateTextFile(sFolderPath & "\DublicateFiles.csv", True)
    oLog.Write "sep=" & vbTab & vbCrLf
    CheckFolder oArgs(I)
    oLog.Close
    Msgbox "Done!"
Else
    Msgbox "Drop Folder"
End If

Sub CheckFolder(sFolderPath)
    Dim sKey
    Dim oFolder 'As Scripting.Folder
    Set oFolder = fso.GetFolder(sFolderPath)

    For Each oFile In oFolder.Files
        sKey = oFile.Name & " - " & oFile.Size

        If dic.Exists(sKey) = False Then 
            dic.Add sKey, oFile.Path
        Else
            oLog.Write oFile.Path & vbTab & dic(sKey) & vbCrLf
        End If
    Next

    For Each oChildFolder In oFolder.SubFolders
        CheckFolder oChildFolder.Path
    Next
End Sub

Function GetFolderPath()
    Dim oFile 'As Scripting.File
    Set oFile = fso.GetFile(WScript.ScriptFullName)
    GetFolderPath = oFile.ParentFolder
End Function
Igor Krupitsky
  • 787
  • 6
  • 9