I would approach this issue in the following way:
- Try to use Java 8 parallel streams or something similar so that my multi-core CPU will be utilized for speeding things up.
- Fill a Map with all files.
- Get all file sizes.
- Eliminate all files which have a unique file size. If your files are "quite normal", this would quite likely already eliminate several of the larger files. This can actually be done by using the file size as the key in a Map and a List as value. Then remove all entries which have a list of size 1.
- Get the checksum of the next (first time: first) 4K bytes of each file.
- Eliminate all files which have a unique checksum.
- Repeat 4. and 5. until the files with no-unique checksums are all read completely.
- The remaining files are files which are duplicate (if the checksum has no collisions).
- Compare the file contents of the duplicate files in case of collisions.
Key to speeding up the checksum would be to do it in chunks and compare in between. If the first bytes already differ, why look at the remaining ones.
Another key to speeding up the comparison might be to invert the Map. I'd use a Map<checksum, List<path>>
instead of Map<path, checksum>
.
This way you can quite directly eliminate all those entries which have a List of size 1 without any further lookup or comparison.
There might be even smarter ways than that, just dumped what came in my mind when I read about the task.
I have sketched a small program that almost performs this. It does not get the checksum in fragments. The reason for this is that when I ran this on 236670 files with a total of 6 GB of data, it took only 7 seconds. Disclaimer: I have an SSD. But maybe I will update the program for partial checksums even.
import java.io.*;
import java.nio.file.*;
import java.nio.file.attribute.*;
import java.util.*;
import java.util.concurrent.*;
import java.util.function.*;
import java.util.stream.*;
import java.util.zip.*;
public class FindDuplicates {
public static void main(final String... args) throws IOException {
findDuplicates(argsOrCurrentDirectory(args));
}
private static String[] argsOrCurrentDirectory(final String... args) {
return args.length == 0 ? new String[] {"."} : args;
}
private static void findDuplicates(final String... paths) throws IOException {
final Stream<Path> allFilesInPaths = find(paths);
final Map<Long, List<Path>> filesBySize = allFilesInPaths.collect(Collectors.groupingByConcurrent(path -> path.toFile().length()));
final Stream<Path> filesWithNonUniqueSizes = getValueStreamFromDuplicates(filesBySize);
final Map<Long, List<Path>> filesByChecksum = filesWithNonUniqueSizes.collect(Collectors.groupingBy(FindDuplicates::getChecksum));
final Stream<Path> filesWithNonUniqueChecksums = getValueStreamFromDuplicates(filesByChecksum);
filesWithNonUniqueChecksums.forEach(System.out::println);
}
private static Stream<Path> toPaths(final String... pathnames) {
return Arrays.asList(pathnames).parallelStream().map(FileSystems.getDefault()::getPath);
}
private static Stream<Path> find(final String... pathnames) {
return find(toPaths(pathnames));
}
private static Stream<Path> find(final Stream<Path> paths) {
return paths.flatMap(FindDuplicates::findSinglePath);
}
private static Stream<Path> findSinglePath(final Path path) {
try {
return Files.find(path, 127, ($, attrs) -> attrs.isRegularFile());
} catch (final IOException e) {
System.err.format("%s: error: Unable to traverse path: %s%n", path, e.getMessage());
return Stream.empty();
}
}
public static <V> Stream<V> getValueStreamFromDuplicates(final Map<?, List<V>> original) {
return original.values().parallelStream().filter(list -> list.size() > 1).flatMap(Collection::parallelStream);
}
public static long getChecksum(final Path path) {
try (final CheckedInputStream in = new CheckedInputStream(new BufferedInputStream(new FileInputStream(path.toFile())), new CRC32())) {
return tryGetChecksum(in);
} catch (final IOException e) {
System.err.format("%s: error: Unable to calculate checksum: %s%n", path, e.getMessage());
return 0L;
}
}
public static long tryGetChecksum(final CheckedInputStream in) throws IOException {
final byte[] buf = new byte[4096];
for (int bytesRead; (bytesRead = in.read(buf)) != -1; );
return in.getChecksum().getValue();
}
}