Fastest algorithm to detect duplicate files

Question

In the process of finding duplicates in my 2 terabytes of HDD stored images I was astonished about the long run times of the tools fslint and fslint-gui.
So I analyzed the internals of the core tool findup which is implemented as very well written and documented shell script using an ultra-long pipe. Essentially its based on find and hashing (md5 and SHA1). The author states that it was faster than any other alternative which I couldn't believe. So I found Detecting duplicate files where the topic quite fast slided towards hashing and comparing hashes which is not the best and fastest way in my opinion.

So the usual algorithm seems to work like this:

generate a sorted list of all files (path, Size, id)
group files with the exact same size
calculate the hash of all the files with a same size and compare the hashes
same has means identical files - a duplicate is found

Sometimes the speed gets increased by first using a faster hash algorithm (like md5) with more collision probability and second if the hash is the same use a second slower but less collision-a-like algorithm to prove the duplicates. Another improvement is to first only hash a small chunk to sort out totally different files.

So I've got the opinion that this scheme is broken in two different dimensions:

duplicate candidates get read from the slow HDD again (first chunk) and again (full md5) and again (sha1)
by using a hash instead just comparing the files byte by byte we introduce a (low) probability of a false negative
a hash calculation is a lot slower than just byte-by-byte compare

I found one (Windows) app which states to be fast by not using this common hashing scheme.

Am I totally wrong with my ideas and opinion?

[Update]

There seems to be some opinion that hashing might be faster than comparing. But that seems to be a misconception out of the general use of "hash tables speed up things". But to generate a hash of a file the first time the files needs to be read fully byte by byte. So there a byte-by-byte-compare on the one hand, which only compares so many bytes of every duplicate-candidate function till the first differing position. And there is the hash function which generates an ID out of so and so many bytes - lets say the first 10k bytes of a terabyte or the full terabyte if the first 10k are the same. So under the assumption that I don't usually have a ready calculated and automatically updated table of all files hashes I need to calculate the hash and read every byte of duplicates candidates. A byte-by-byte compare doesn't need to do this.

[Update 2]

I've got a first answer which again goes into the direction: "Hashes are generally a good idea" and out of that (not so wrong) thinking trying to rationalize the use of hashes with (IMHO) wrong arguments. "Hashes are better or faster because you can reuse them later" was not the question. "Assuming that many (say n) files have the same size, to find which are duplicates, you would need to make n * (n-1) / 2 comparisons to test them pair-wise all against each other. Using strong hashes, you would only need to hash each of them once, giving you n hashes in total." is skewed in favor of hashes and wrong (IMHO) too. Why can't I just read a block from each same-size file and compare it in memory? If I have to compare 100 files I open 100 file handles and read a block from each in parallel and then do the comparison in memory. This seams to be a lot faster then to update one or more complicated slow hash algorithms with these 100 files.

[Update 3]

Given the very big bias in favor of "one should always use hash functions because they are very good!" I read through some SO questions on hash quality e.g. this: Which hashing algorithm is best for uniqueness and speed? It seams that common hash functions more often produce collisions then we think thanks to bad design and the birthday paradoxon. The test set contained: "A list of 216,553 English words (in lowercase), the numbers "1" to "216553" (think ZIP codes, and how a poor hash took down msn.com) and 216,553 "random" (i.e. type 4 uuid) GUIDs". These tiny data sets produced from arround 100 to nearly 20k collisions. So testing millions of files on (in)equality only based on hashes might be not a good idea at all.

I guess I need to modify 1 and replace the md5/sha1 part of the pipe with "cmp" and just measure times. I keep you updated.

[Update 4] Thanks for alle the feedback. Slowly we are converting. Background is what I observed when fslints findup had running on my machine md5suming hundreds of images. That took quite a while and HDD was spinning like hell. So I was wondering "what the heck is this crazy tool thinking in destroying my HDD and taking huge amounts of time when just comparing byte-by-byte" is 1) less expensive per byte then any hash or checksum algorithm and 2) with a byte-by-byte compare I can return early on the first difference so I save tons of time not wasting HDD bandwidth and time by reading full files and calculating hashs over full files. I still think thats true - but: I guess I didn't catch the point that a 1:1 comparison (if (file_a[i] != file_b[i]) return 1;) might be cheaper than is hashing per byte. But complexity wise hashing with O(n) may win when more and files need to be compared against each other. I have set this problem on my list and plan to either replace the md5 part of findup's fslint with cmp or enhance pythons filecmp.py compare lib which only compares 2 files at once with a multiple files option and maybe a md5hash version. So thank you all for the moment. And generally the situation is like you guys say: the best way (TM) totally depends on the circumstances: HDD vs SSD, likelyhood of same length files, duplicate files, typical files size, performance of CPU vs. Memory vs. Disk, Single vs. Multicore and so on. And I learned that I should considder more often using hashes - but I'm an embedded developer with most of the time very very limited resources ;-)

Thanks for all your effort! Marcel

If you've got only two files that are the same size, then I would agree that a simple comparison is all that's needed. But, if you've got 100 files that are all the same size, then the hash solution is much faster. — user3386109, Nov 15 '18 at 08:34
I'm talking about millions of files and I see no difference in speed whether Itry to find only 2 duplicates or millions. A hash function needs to read all bytes from the HDD plus does "complicate" calculations on every byte. The result is an ID which needs to be compared after. A byte by byte comparison can cancel the compare on the first different byte. — Marcel, Nov 15 '18 at 08:40
Do you really have millions of files that are the exact same size? — user3386109, Nov 15 '18 at 08:48
A hash does not necessarily have to read the *entire* file - it could be adaptive and read just the first 1MB and last 1 MB on multi-GB files for example. — Mark Setchell, Nov 15 '18 at 09:00
If you only have a few files to compare (based on size) it seems that just comparing them without hashing is faster. The problem occurs when you cannot fit all the file contents (files of the same size) in memory. Obviously for GB files you can just read the first 10MB of all them and compare that, etc. Exactly how to implement that seems to depend on how to schedule read instructions and physical memory, and it seems that hashing seems optimized for the case where you don't have much physical memory. — Hans Olsson, Nov 15 '18 at 09:05
All hash algorithms I know read every byte of the file which needs to be hashed. So why should hashing be faster then comparing byte by byte? — Marcel, Nov 15 '18 at 09:23
See my previous comment - *"a hash does not necessarily have to read the entire file"*. — Mark Setchell, Nov 15 '18 at 09:35
@Marcel: Hashing gives you a short value which you can cache to speed up subsequent comparisons. — Richard, Nov 15 '18 at 09:40
The question is: "Fastest algorithm to detect duplicate files". Only hashing first and last chunks of the said file might speed up the hashing based algorithm but does as well for the comparing-based version. I guess that comparing instead of hashing will be faster every time. — Marcel, Nov 15 '18 at 09:41
Well, you could do both: calculate a hashvalue from the first xxx bytes and only dig deeper if duplicates occur. — joop, Nov 15 '18 at 09:41
`read a block from each in parallel…` I'd be impressed `…and then do the comparison in memory` *and then do the 5050 comparisons in memory* — greybeard, Nov 15 '18 at 12:46

tucuxi · Answer 1 · 2018-11-16T10:37:10.953

The fastest de-duplication algorithm will depend on several factors:

how frequent is it to find near-duplicates? If it is extremely frequent to find hundreds of files with the exact same contents and a one-byte difference, this will make strong hashing much more attractive. If it is extremely rare to find more than a pair of files that are of the same size but have different contents, hashing may be unnecessary.
how fast is it to read from disk, and how large are the files? If reading from the disk is very slow or the files are very small, then one-pass hashes, however cryptographically strong, will be faster than making small passes with a weak hash and then a stronger pass only if the weak hash matches.
how many times are you going to run the tool? If you are going to run it many times (for example to keep things de-duplicated on an on-going basis), then building an index with the path, size & strong_hash of each and every file may be worth it, because you would not need to rebuild it on subsequent runs of the tool.
do you want to detect duplicate folders? If you want to do so, you can build a Merkle tree (essentially a recursive hash of the folder's contents + its metadata); and add those hashes to the index too.
what do you do with file permissions, modification date, ACLs and other file metadata that excludes the actual contents? This is not related directly to algorithm speed, but it adds extra complications when choosing how to deal with duplicates.

Therefore, there is no single way to answer the original question. Fastest when?

Assuming that two files have the same size, there is, in general, no fastest way to detect whether they are duplicates or not than comparing them byte-by-byte (even though technically you would compare them block-by-block, as the file-system is more efficient when reading blocks than individual bytes).

Assuming that many (say n) files have the same size, to find which are duplicates, you would need to make n * (n-1) / 2 comparisons to test them pair-wise all against each other. Using strong hashes, you would only need to hash each of them once, giving you n hashes in total. Even if it takes k times as much to hash than to compare byte-by-byte, hashing is better when k > (n-1)/2. Hashes may yield false-positives (although strong hashes will only do so with astronomically low probabilities), but testing those byte-by-byte will only increment k by at most 1. With k=3, you will be ahead as soon as n>=7; with a more conservative k=2, you reach break-even with n=3. In practice, I would expect k to be very near to 1: it will probably be more expensive to read from disk than to hash whatever you have read.

The probability that several files will have the same sizes increases with the square of the number of files (look up birthday paradox). Therefore, hashing can be expected to be a very good idea in the general case. It is also a dramatic speedup in case you ever run the tool again, because it can reuse an existing index instead of building it anew. So comparing 1 new file to 1M existing, different, indexed files of the same size can be expected to take 1 hash + 1 lookup in the index, vs. 1M comparisons in the no-hashing, no-index scenario: an estimated 1M times faster!

Note that you can repeat the same argument with a multilevel hash: if you use a very fast hash with, say, the 1st, central and last 1k bytes, it will be much faster to hash than to compare the files (k < 1 above) - but you will expect collisions, and make a second pass with a strong hash and/or a byte-by-byte comparison when found. This is a trade-off: you are betting that there will be differences that will save you the time of a full hash or full compare. I think it is worth it in general, but the "best" answer depends on the specifics of the machine and the workload.

[Update]

The OP seems to be under the impression that

Hashes are slow to calculate
Fast hashes produce collisions
Use of hashing always requires reading the full file contents, and therefore is overkill for files that differ in their 1st bytes.

I have added this segment to counter these arguments:

A strong hash (sha1) takes about 5 cycles per byte to compute, or around 15ns per byte on a modern CPU. Disk latencies for a spinning hdd or an ssd are on the order of 75k ns and 5M ns, respectively. You can hash 1k of data in the time that it takes you to start reading it from an SSD. A faster, non-cryptographic hash, meowhash, can hash at 1 byte per cycle. Main memory latencies are at around 120 ns - there's easily 400 cycles to be had in the time it takes to fulfill a single access-noncached-memory request.
In 2018, the only known collision in SHA-1 comes from the shattered project, which took huge resources to compute. Other strong hashing algorithms are not much slower, and stronger (SHA-3).
You can always hash parts of a file instead of all of it; and store partial hashes until you run into collisions, which is when you would calculate increasingly larger hashes until, in the case of a true duplicate, you would have hashed the whole thing. This gives you much faster index-building.

My points are not that hashing is the end-all, be-all. It is that, for this application, it is very useful, and not a real bottleneck: the true bottleneck is in actually traversing and reading parts of the file-system, which is much, much slower than any hashing or comparing going on with its contents.

"Assuming that many (say n) files have the same size, to find which are duplicates, you would need to make n * (n-1) / 2 comparisons to test them pair-wise all against each other." Thats wrong. I as a developer naturally would read the same block from every potential duplicate and only compare in memory. So Hashing has no advantage given that we are not talking about hash tables, caching and so on. — Marcel, Nov 15 '18 at 11:53
"Therefore, hashing can be expected to be a very good idea in the general case. It is also a dramatic speedup in case you ever run the tool again, because it can reuse an existing index instead of building it anew" Thats only true when we can surely know that the file (e.g. a databases table) wasn't modified. — Marcel, Nov 15 '18 at 11:56
Good answer. I'd add "How frequent is it to find many files of the same size?" Even if they're not near-duplicates, having large numbers of files all of the same size lends itself to the hashing approach. — patros, Nov 15 '18 at 18:05
@Marcel you can test for that easily by storing file metadata in the index. Most filesystems update a files' last-modified time every time it is modified. You would err on the side of caution: even if the modification does not actually change bytes, you would need to recalculate the hash, just-in-case. — tucuxi, Nov 16 '18 at 08:26
@Marcel re in-memory comparisons: reading files to memory is slow. Reading new blocks into memory is also slow, as compared to accessing the different caches in the processor. If you want to compare a lot of memory blocks against each other to find which are duplicates, sorting is O(n log n), hashing is O(n), and pair-wise comparisons are O(n^2) -- which one are you proposing? Asking as a developer. — tucuxi, Nov 16 '18 at 08:32
@tucuxi - Instead of using sorting or hashing you should likely use a trie [sic!] for that. The reason is that your numbers are not the full truth. If you want to compare a large set of strings of length k sorting can be between O(n*log(n)) and O(n*log(n)*k) depending on where the strings differ. Hashing is always O(n*k), with a larger constant and then you still need to compare the strings with the same hash. Additionally if strings don't differ you should, of course, remove them. — Hans Olsson, Nov 16 '18 at 08:53
@HansOlsson building the trie is not free, either in space or in time: O(n*k); yes, you only pay full price if there is a match or a close match; but storing a file in a trie requires a trie of the size of the file; hashes are always constant-sized. Regarding hashing always being expensive, you can do partial hashing first, say first few and last few bytes, and only re-check on collisions. That would avoid looking at the full length-k for most cases. — tucuxi, Nov 16 '18 at 10:13
@tucuxi If you have memory enough for the files then memory is not a major issue. And hashing is not the only algorithm that could look at only the first and last bytes; you could a preliminary sort based on the same information as well - or partially build a trie. However, the main issue is that hashing doesn't fully solve the problem - you need both hashing and some way of checking hash-collisions; and if that second algorithm is good you can skip the hashing step. However, the main conclusion is reading the entire file contents multiple times isn't desirable. — Hans Olsson, Nov 16 '18 at 10:22
@tucuxi I had a look into meowhash and learned that its based on using AES hardware support. So in order to be as fast as they state the CPU needs to have fast AES hardware support. Unhappily my own Linux server's CPU is to old to have that... — Marcel, Nov 22 '18 at 09:49

Matt Timmermans · Answer 2 · 2018-11-15T16:06:46.310

The most important thing you're missing is that comparing two or more large files byte-for-byte while reading them from a real spinning disk can cause a lot of seeking, making it vastly slower than hashing each individually and comparing the hashes.

This is, of course, only true if the files actually are equal or close to it, because otherwise a comparison could terminate early. What you call the "usual algorithm" assumes that files of equal size are likely to match. That is often true for large files generally.

But...

When all the files of the same size are small enough to fit in memory, then it can indeed be a lot faster to read them all and compare them without a cryptographic hash. (an efficient comparison will involve a much simpler hash, though).

Similarly when the number of files of a particular length is small enough, and you have enough memory to compare them in chunks that are big enough, then again it can be faster to compare them directly, because the seek penalty will be small compared to the cost of hashing.

When your disk does not actually contain a lot of duplicates (because you regularly clean them up, say), but it does have a lot of files of the same size (which is a lot more likely for certain media types), then again it can indeed be a lot faster to read them in big chunks and compare the chunks without hashing, because the comparisons will mostly terminate early.

Also when you are using an SSD instead of spinning platters, then again it is generally faster to read + compare all the files of the same size together (as long as you read appropriately-sized blocks), because there is no penalty for seeking.

So there are actually a fair number of situations in which you are correct that the "usual" algorithm is not as fast as it could be. A modern de-duping tool should probably detect these situations and switch strategies.

Thanks for your answer - the first one to be not out of the "hash hash hash is all we need" class of answers. Seams to be fairly hard for humans to switch their beloved tool even in not fitting situations. — Marcel, Nov 15 '18 at 20:24
How would you "read + compare all the files of the same size together" in an efficient manner, without any type of even weak hashes? Presumably you would either sort by contents or perform pairwise comparisons... — tucuxi, Nov 16 '18 at 08:34
For the seek time I believe some modern hard-drive interfaces allow queues of read-operations, so an additional possibility is to try compare multiple files in parallel and let the disk handler minimize seek time. — Hans Olsson, Nov 16 '18 at 09:00
@tucuxi using an MSB-first radix sort, for example... but anyway this is a practical question so there's no reason to impose pedantic rules like "you must not use any kind of hashing whatsoever". — Matt Timmermans, Nov 16 '18 at 13:24
@MattTimmermans MSB-first radix sort *is* sorting by content. Also, the 1st byte can be seen as a (very) weak hash, and the whole algorithm as similar to the incremental hashing scheme that I propose in my answer. — tucuxi, Nov 16 '18 at 15:09

score 4 · Answer 3 · answered Nov 15 '18 at 18:57

Byte-by-byte comparison may be faster if all file groups of the same size fit in physical memory OR if you have a very fast SSD. It also may still be slower depending on the number and nature of the files, hashing functions used, cache locality and implementation details.

The hashing approach is a single, very simple algorithm that works on all cases (modulo the extremely rare collision case). It scales down gracefully to systems with small amounts of available physical memory. It may be slightly less than optimal in some specific cases, but should always be in the ballpark of optimal.

A few specifics to consider:

1) Did you measure and discover that the comparison within file groups was the expensive part of the operation? For a 2TB HDD walking the entire file system can take a long time on its own. How many hashing operations were actually performed? How big were the file groups, etc?

2) As noted elsewhere, fast hashing doesn't necessarily have to look at the whole file. Hashing some small portions of the file is going to work very well in the case where you have sets of larger files of the same size that aren't expected to be duplicates. It will actually slow things down in the case of a high percentage of duplicates, so it's a heuristic that should be toggled based on knowledge of the files.

3) Using a 128 bit hash is probably sufficient for determining identity. You could hash a million random objects a second for the rest of your life and have better odds of winning the lottery than seeing a collision. It's not perfect, but pragmatically you're far more likely to lose data in your lifetime to a disk failure than a hash collision in the tool.

4) For a HDD in particular (a magnetic disk), sequential access is much faster than random access. This means a sequential operation like hashing n files is going to be much faster than comparing those files block by block (which happens when they don't fit entirely into physical memory).

Fastest algorithm to detect duplicate files

3 Answers3

Linked

Related