0

I'm looking for a simple way to detect if files have changed in a directory between reboots to avoid unnecessary synchronization. What would be the simplest way to do this in java 8 libraries? Should I XOR the md5 digest of each file or XOR the checksums of each file?

ATM we don't need to handle going into sub directories.

Also we should not be using an OS event to detect this change as the method to detect will only be called at startup. The number of files in the directory can change between different versions of the application but these files will not generally change between reboots.

This looks like a relevant post: https://crypto.stackexchange.com/questions/1368/is-it-a-good-idea-to-use-bitwise-xor-on-a-set-of-md5-sums

simgineer
  • 1,754
  • 2
  • 22
  • 49
  • 3
    Would this help?https://docs.oracle.com/javase/tutorial/essential/io/notification.html – Lyju I Edwinson Nov 06 '19 at 23:47
  • @LyjuIEdwinson Thanks but I'm specifically not looking to detect file changes by OS events as this routine will only be triggered at system startup. – simgineer Nov 06 '19 at 23:59
  • What exactly is meant by "to avoid unnecessary synchronization"? Are you mirroring files to another directory? Then maybe you shuold better just use rsync instead of rolling your own. – Axel Nov 07 '19 at 00:29
  • @Axel It's specific to our application, when these files are modified we need to resync our controller to our database and this process takes a while. Basically there is a large file that is being refactored into smaller xml files via xinclude statements and now I want to dynamically detect if these smaller files have been modified instead of maintaining a list of files in the code. We previously just cached a copy of the md5 of our large file. – simgineer Nov 07 '19 at 00:35
  • @simgineer You should carefully read the file notification, as it not only fired at system startup. – Mạnh Quyết Nguyễn Nov 07 '19 at 04:20

3 Answers3

2

It depends what you mean by "simple".

On the one hand, you could make use of the file timestamps. But the problem is that timestamps can be misleading:

  • Checks depending on time stamps could be affected by clock skew issues. (It depends which clocks are involved, and on how clocks are managed.)

  • It is possible for file timestamps to be reset (e.g. by the "root" user) making it appear that a file has not changed.

  • It is trivial to change a "modified" file timestamp without actually changing the file; e.g. touch.

On the other hand, if you use checksums you have other problems:

  • Computing a file checksum entails reading the entire file. (A partial checksum is not sufficient to detect changes, in general.) Some checksum algorithms are relatively expensive as well.

  • You also need to know what the previous checksum for the file was. That means that you need a way / place to store it. That could be just another file, but then you need some infrastructure to update that file (reliably) as part of the synchronization procedure.

  • XORing multiple checksums has the problem that you then don't know which files have changed. If one file changes you need to synchronize all of them.

  • It is theoretically possible for a file to change and the MD5 checksum to be the same: probability 1 in 2^128. You can probably discount this ... unless yours is a security critical application. (Note that MD5 collision attacks are practical in some contexts; see https://en.wikipedia.org/wiki/Collision_attack)


The other thing is that I suspect that you are trying to solve a solved problem. For example, the Linux / Unix rsync utility has options to use either timestamps or (MD5) checksums to decide which files need to be synchronized.

You don't need to implement everything yourself (in Java).

In response to your "we don't have access to the old file tree" there is an easy solution to that. Each time you reboot:

  1. copy the file tree
  2. compare the current files against the copy that you made last time you rebooted.

Like I said in a comment use your imagination.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • Appreciate the insight. BTW - This isn't a file sync scenario. we parse configs files (which are large) and then if there is a change there is a DB update (which takes time). It is a legacy program. Not something where it makes sense to use rsync. – simgineer Nov 07 '19 at 01:36
  • You can handle that with `rsync`. It just takes a bit of imagination. Alternatively, if this is not a (remote) file system sync problem you could simply compare files in an "old" and "new" file tree. There are Linux / UNIX utilities to do that too. – Stephen C Nov 07 '19 at 01:41
  • Yes, and if you go with @StephenC's suggestion, and you are using at least Java 12 (I know, it's not quite common for production code), you should check out [Files.mismatch(Path, Path)](https://docs.oracle.com/en/java/javase/12/docs/api/java.base/java/nio/file/Files.html#mismatch(java.nio.file.Path,java.nio.file.Path)) which was introduced in Java 12. – Axel Nov 07 '19 at 08:49
  • Hi @StephenC and Axel, I believe what is being suggested w `rsync` requires the original directory structure to be available and we don't have access to the old file system, all we store from the old file system is a catch or checksum of the files that matter. The synchronization routine runs after the old configuration files have been overwritten with the new configuration files. To clarify the synchronization is not between two file systems but a set of configuration files and a settings matrix engine backed by mysql tables. The xml define setting types and the DB stores data per profile. – simgineer Nov 08 '19 at 19:11
0

Is the file's modified time useful in your situation? Md5sum is a preciser way for some situations.

Justin LI
  • 94
  • 3
0

Here's a routine I'm looking to generate a hash from all files in a directory.

/**
 * Recursively compute a single md5 signature of all files in a directory. Is typically 
 * used to determine if a file in a directory or any of the sub directories have been 
 * modified since the last digest was taken.
 */
public class DirectoryDigest {

    MessageDigest md = null;

    public DirectoryDigest() {
        try {
            md = MessageDigest.getInstance("MD5");
        } catch (NoSuchAlgorithmException e) {
            ApplicationManager.logStackTrace(e);
        }
    }

    public void update(Path dirPath) {
        update(dirPath, null);
    }

    public synchronized void update(Path dirPath, String extension) {

        try {
            Files.newDirectoryStream(dirPath).forEach(file -> {
                if (!Files.isDirectory(file) && file.getFileName().toString().endsWith(extension)) {
                    if (extension != null && !file.getFileName().toString().endsWith(extension)) {
                        System.out.println("not processing: " + file.getFileName());
                        return;
                    }
                    try {
                        byte[] bytes = Files.readAllBytes(file);
                        md.update(bytes);
                    } catch (IOException e) {
                        ApplicationManager.logStackTrace(e);
                    }
                } else {
                    update(file, extension);
                }
            });
        } catch (IOException e) {
            ApplicationManager.logStackTrace(e);
        }
    }

    /**
     * Returns md5 digest signature and resets the digest object.
     * @return
     */
    public String digest() {
        return String.format("%032X", new BigInteger(1, md.digest()));
    }
}

It is used like this:

DirectoryDigest dd = new DirectoryDigest();
dd.update(csConfigDirPath, ".xml");
String currentPeripheralHash = dd.digest();
simgineer
  • 1,754
  • 2
  • 22
  • 49