2

In Java, here is one of several ways to process a "snapshot" of the files in a particular directory:

String directory = "/path/to/directory";
List<File> fileList = Arrays.asList((new File(directory)).listFiles());
fileList.parallelStream.forEach(file->{
    Path fileAsPath = file.toPath();
    // Assume the process method finishes by deleting the file or moving it to another directory
    process(fileAsPath);
});

And here is one of several ways to process files that are added to the directory:

WatchService watchService = FileSystems.getDefault().newWatchService();
Path directoryAsPath = Paths.get(directory);
WatchKey watchKey = directoryAsPath.register(watchService, ENTRY_CREATE);

while (true) {
    WatchKey key;
    key = watchService.take();

    for (WatchEvent<?> event: key.pollEvents()) {
        WatchEvent.Kind<?> kind = event.kind();
        if (kind == OVERFLOW) {
            continue;
        }

        Path filename = event.context();
        // Again, assume the process method finishes by deleting the file or moving it
        // to another directory
        process(filename);
    }
}

What would be a fairly straightforward approach to process pre-existing files in the directory -- such as when the process starts -- and also process files that are subsequently added?

Each file should be processed exactly once. In this situation, the order in which files are processed does not matter.

I suppose one straightforward way would be to put the first block of logic in an infinite loop -- just have the listFiles() method take a new snapshot of the directory, perhaps with a brief delay between iterations -= but this seems clunky. It's possible that files can be on the order of tens of megabytes. It would be nice not to have to wait for an entire "snapshot" of files to be fully processed before beginning another "snapshot" of files.

Using a database to track the files that have been processed seems overly complicated.

Thanks!

  • Which OS are you on? – Bohemian Oct 05 '20 at 19:25
  • 1
    I don’t understand the problem. When you say “the process method finishes by deleting the file or moving it to another directory”, there is no possibility of processing a file twice. – Holger Oct 06 '20 at 09:50
  • @Holger Each file can be multiple megabytes in size. Processing it can take time -- during which a second attempt might be made to process it. Hmm... maybe one workaround would be to, as a pre-processing action, move the file to a unique, temporary directory within which it will be processed. – Dynotherm Connector Oct 06 '20 at 14:52
  • 1
    You’ve shown a single loop. Processing can not overlap when you process one file after another. – Holger Oct 06 '20 at 14:54
  • There are two loops: (1) the loop that processes pre-existing files, and (2) the loop that processes newly added files. I want to ensure that each file is processed exactly once. I assume each would need to run in a separate thread -- otherwise, some files might get missed by both loops. With separate threads, there is a possibility that both threads might try to process the same file. – Dynotherm Connector Oct 06 '20 at 16:35

1 Answers1

0

Use 2 directories.

First move existing files out to a temp dir, then copy them back. These files, and ones created, will all trigger the watch as new files.

If you’re on Linux, you could instead try touch each existing file (untested, but may be enough to trigger the watch).

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • Depending on how long it takes to move the existing files, isn't it possible that some files could get missed -- files added after the original "snapshot" of files is moved, but before the watch is registered? – Dynotherm Connector Oct 06 '20 at 03:20