0

I have the following situation: a directory tree with big files (about 5000 files with ~4Gb size). I need to find duplicates in this tree.

I tried to use the CRC32 and Adler32 classes built-in to Java, but it VERY slow (about 3-4 minutes per file).

The code was like this:

1) Init map <path, checksum>
2) Create CheckSum instance (CRC32 or Adler32);
3) Read file per block (10-100 bytes);
4) In each iteration call update()
5) Result checksum passed in map <path, summ>
6) Find duplicates

Question: is there any way to speed up gathering checksums in lines 3-4?

psmears
  • 26,070
  • 4
  • 40
  • 48
Ivan
  • 490
  • 1
  • 7
  • 23
  • 1
    The CRC32 or Adler32 checksums are not _that_ slow, there is another, unknown, reason as to why it's running as slow as you observer - albeit, if you're implying you only read 10-100 bytes at a time form a 4GB file, you need to up that by a few orders of magnitude, 4096 bytes is a good start. Also time where you're spending time, so you're certain most of the time is not spent in a bad implementation of your duplicate detection in your point 6) – nos Mar 05 '15 at 12:44
  • 1
    How similar are the files? If they mostly differ, I would eliminate most of them just by the first few hundred bytes, then work on the remaining. – Kayaman Mar 05 '15 at 12:46
  • @nos Simple perfomance test says what lines 3-4 spend most of the time. Also, that's a good idea to increase block size, thanks :) – Ivan Mar 05 '15 at 12:51
  • 1
    @Ivan Then you can disable your step 4 to see how much things improve. Be sure that the file system cache does not trick you in your tests. If the files need to be read from the hard drive, it'll be slow. Hard drives are slow. If you've previously read the same file, it resides in ram, so reading it doesn't access the hard drive and things go fast - just make sure you're always testing the same thing. – nos Mar 05 '15 at 14:04
  • Is tarted a Java-based duplicate finder project at https://github.com/koppor/kodf/ (after investigating other duplicate finders. The one at https://github.com/carlbeech/fast-duplicate-finder was most promising). The idea is to first compare the file size then calculate checksums if necessary. – koppor May 17 '20 at 10:59

1 Answers1

5

I would approach this issue in the following way:

  1. Try to use Java 8 parallel streams or something similar so that my multi-core CPU will be utilized for speeding things up.
  2. Fill a Map with all files.
  3. Get all file sizes.
  4. Eliminate all files which have a unique file size. If your files are "quite normal", this would quite likely already eliminate several of the larger files. This can actually be done by using the file size as the key in a Map and a List as value. Then remove all entries which have a list of size 1.
  5. Get the checksum of the next (first time: first) 4K bytes of each file.
  6. Eliminate all files which have a unique checksum.
  7. Repeat 4. and 5. until the files with no-unique checksums are all read completely.
  8. The remaining files are files which are duplicate (if the checksum has no collisions).
  9. Compare the file contents of the duplicate files in case of collisions.

Key to speeding up the checksum would be to do it in chunks and compare in between. If the first bytes already differ, why look at the remaining ones.

Another key to speeding up the comparison might be to invert the Map. I'd use a Map<checksum, List<path>> instead of Map<path, checksum>. This way you can quite directly eliminate all those entries which have a List of size 1 without any further lookup or comparison.

There might be even smarter ways than that, just dumped what came in my mind when I read about the task.

I have sketched a small program that almost performs this. It does not get the checksum in fragments. The reason for this is that when I ran this on 236670 files with a total of 6 GB of data, it took only 7 seconds. Disclaimer: I have an SSD. But maybe I will update the program for partial checksums even.

import java.io.*;
import java.nio.file.*;
import java.nio.file.attribute.*;
import java.util.*;
import java.util.concurrent.*;
import java.util.function.*;
import java.util.stream.*;
import java.util.zip.*;

public class FindDuplicates {
    public static void main(final String... args) throws IOException {
        findDuplicates(argsOrCurrentDirectory(args));
    }

    private static String[] argsOrCurrentDirectory(final String... args) {
        return args.length == 0 ? new String[] {"."} : args;
    }

    private static void findDuplicates(final String... paths) throws IOException {
        final Stream<Path> allFilesInPaths = find(paths);

        final Map<Long, List<Path>> filesBySize = allFilesInPaths.collect(Collectors.groupingByConcurrent(path -> path.toFile().length()));
        final Stream<Path> filesWithNonUniqueSizes = getValueStreamFromDuplicates(filesBySize);

        final Map<Long, List<Path>> filesByChecksum = filesWithNonUniqueSizes.collect(Collectors.groupingBy(FindDuplicates::getChecksum));
        final Stream<Path> filesWithNonUniqueChecksums = getValueStreamFromDuplicates(filesByChecksum);

        filesWithNonUniqueChecksums.forEach(System.out::println);
    }

    private static Stream<Path> toPaths(final String... pathnames) {
        return Arrays.asList(pathnames).parallelStream().map(FileSystems.getDefault()::getPath);
    }

    private static Stream<Path> find(final String... pathnames) {
        return find(toPaths(pathnames));
    }

    private static Stream<Path> find(final Stream<Path> paths) {
        return paths.flatMap(FindDuplicates::findSinglePath);
    }

    private static Stream<Path> findSinglePath(final Path path) {
        try {
            return Files.find(path, 127, ($, attrs) -> attrs.isRegularFile());
        } catch (final IOException e) {
            System.err.format("%s: error: Unable to traverse path: %s%n", path, e.getMessage());
            return Stream.empty();
        }
    }

    public static <V> Stream<V> getValueStreamFromDuplicates(final Map<?, List<V>> original) {
        return original.values().parallelStream().filter(list -> list.size() > 1).flatMap(Collection::parallelStream);
    }

    public static long getChecksum(final Path path) {
        try (final CheckedInputStream in = new CheckedInputStream(new BufferedInputStream(new FileInputStream(path.toFile())), new CRC32())) {
            return tryGetChecksum(in);
        } catch (final IOException e) {
            System.err.format("%s: error: Unable to calculate checksum: %s%n", path, e.getMessage());
            return 0L;
        }
    }

    public static long tryGetChecksum(final CheckedInputStream in) throws IOException {
        final byte[] buf = new byte[4096];
        for (int bytesRead; (bytesRead = in.read(buf)) != -1; );
        return in.getChecksum().getValue();
    }
}
Christian Hujer
  • 17,035
  • 5
  • 40
  • 47
  • This stuff made me curious. My solution takes 3 seconds to identify all files with duplicate sizes in `/usr/` which is 6.0G (`du -csh /usr/`) resp. 236670 files (`find /usr/ -type f | wc -l`). This is fun! – Christian Hujer Mar 05 '15 at 14:22
  • By the way, you will not want to keep the files open when you do partial checksums. You will want to reopen them and use `seek()`. All operating systems known to me limit the number of files that could be simultaneously open. – Christian Hujer Mar 05 '15 at 14:24
  • Wow, I forget about CheckedInputStream. There is two things. First - Java 8. We use Java 7 in our project and currently can't upgrade version. Second is what you have SSD :) So, I will adapt your code to 7 version with some changes in logics. Thanks very much! – Ivan Mar 05 '15 at 16:32
  • You're still using rotational storage?! Upgrade! The age of rotational storage is over! ;) – Christian Hujer Mar 05 '15 at 19:37
  • Actually no. About 4-5 years need to full integrate solid state drives in our life. Like Win XP. From 2001 to 2014 we used it and in second half of 2014 we update it to Win 7. Although Win 10 is coming in april. Inertness :) – Ivan Mar 06 '15 at 11:50