4

The problem.

Some daemon implemented in Java, running on Windows 7, copies files from one directory into another, while both source and target directory are a network share hosted by Windows Server 2016. Copying is done using Apache Commons IO and occasionally it happens that this process fails with the following stacktrace and a message reading somewhat like "no more files":

java.io.IOException: Es sind keine weiteren Dateien vorhanden
        at java.io.WinNTFileSystem.canonicalize0(Native Method)
        at java.io.WinNTFileSystem.canonicalize(Unknown Source)
        at java.io.File.getCanonicalPath(Unknown Source)
        at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:642)
        at org.apache.commons.io.FileUtils.copyFileToDirectory(FileUtils.java:587)
        at org.apache.commons.io.FileUtils.copyFileToDirectory(FileUtils.java:558)
        at de.am_soft.osgi.dokliste.eingaenge.impl.internal.Eingang.copyFilesToDbxmlFolders(Eingang.java:283)

Apache Commons IO uses the following code at line 642 and the line really only is the following if, not the exception:

if (srcFile.getCanonicalPath().equals(destFile.getCanonicalPath())) {
    throw new IOException("Source '" + srcFile + "' and destination '" + destFile + "' are the same");
}

So the problem is not with copying itself, but with generating canonical paths already. Using Process Monitor1 at the client where the daemon runs proves that as well. The following is the last event before the daemon clearly logs the above exception, tries to send error mails using Logback and stuff. The result of that event (NO MORE FILES) perfectly well fits to the error message of the stacktrace:

10:12:06,6244515        integration.exe 6928    QueryDirectory  \\HOST\SHARE$\DocBeam3\[...].zip  NO MORE FILES   Filter: 20191106-081920-[...].zip

Additionally, looking at former lines of ProcMon, it's sure that the exception happens for destFile only. Executing the daemon on my local machine instead leads to the following logged event (NO SUCH FILE) always:

19:08:03,7485947    java.exe    6232    QueryDirectory  C:\Users\[...].zip  NO SUCH FILE    Filter: 20191022-143101-[...].zip

I've debugged the native methods and came across lastErrorReportable, which explicitly checks for some special error codes and doesn't contain ERROR_NO_MORE_FILES from the first event, while it does contain ERROR_FILE_NOT_FOUND from the second one:

    if ((errval == ERROR_FILE_NOT_FOUND)
        || (errval == ERROR_DIRECTORY)
        || (errval == ERROR_PATH_NOT_FOUND)
        || (errval == ERROR_BAD_NETPATH)
        || (errval == ERROR_BAD_NET_NAME)
        || (errval == ERROR_ACCESS_DENIED)
        || (errval == ERROR_NETWORK_UNREACHABLE)
        || (errval == ERROR_NETWORK_ACCESS_DENIED)) {
        return 0;
    }

https://github.com/openjdk/jdk/blob/master/src/java.base/windows/native/libjava/canonicalize_md.c#L131

So it seems like whenever ERROR_NO_MORE_FILES occurs, canonicalizing a path simply gets aborted with an error instead of ignoring it like for the other errors:

if (!lastErrorReportable()) {
   if (!(dst = wcp(dst, dend, L'\0', src, src + wcslen(src)))){
       goto err;
   }
    break;
} else {
    goto err;
}

https://github.com/openjdk/jdk/blob/master/src/java.base/windows/native/libjava/canonicalize_md.c#L246

The thrown exception fits pretty well to what I get, with the given message only being a fallback not used in my case:

if (rv == NULL && !(*env)->ExceptionCheck(env)) {
    JNU_ThrowIOExceptionWithLastError(env, "Bad pathname");
}

https://github.com/openjdk/jdk/blob/master/src/java.base/windows/native/libjava/WinNTFileSystem_md.c#L258

Additional observations.

The interesting thing now is that the daemon doesn't fail always on each and every file copy, but only sometimes, somewhat rarely. But if it fails it seems to have to do with other directories and files being available in the target directory already. While those are completely unrelated to the daemon and according to ProcMon those don't get iterated or stuff, their pure existance seems to make a difference already. If I simply delete all of those files and directories and empty the target directory this way, copying instantly succeeds again. That's interesting because having files and directories in the target directory in my local setup doesn't seem to have any influence: Copying never fails and especially the event logged by ProcMon NEVER is ERROR_NO_MORE_FILES as well. After emptying the directory on the setup where the problem happens, ProcMon logs ERROR_FILE_NOT_FOUND again as well.

The question.

So it seems that for some reason under some currently unknown circumstances, Windows decides to use ERROR_NO_MORE_FILES as last error in the calls to FindFirstFileW used by wcanonicalize. Because Java doesn't have that on its exception list, copying fails in those circumstances, even if it seems to be a perfectly valid situation. I don't see any real error otherwise.

So should ERROR_NO_MORE_FILES be added to lastErrorReportable? And if so, who do I need to ask for actually? :-)

Thorsten Schöning
  • 3,501
  • 2
  • 25
  • 46
  • https://mail.openjdk.java.net/pipermail/core-libs-dev/2019-November/063437.html – Thorsten Schöning Nov 15 '19 at 08:00
  • Curious if you did find out anything more since then and how you've tackled the issue. At work we're now facing this issue after the upgrade of a file cluster from WS 2K8R2 to WS 2019. – dSebastien Jul 07 '20 at 09:35
  • @dSebastien No, I implemented workarounds somewhat assuring that target directories I work with are empty. That made the problem almost(?) go away. I came across similar problems in a native Win32-app implemented in C++ when using `FindFirstFileW` as well. Under rare circumstances, calling that resulted in `ERROR_NO_MORE_FILES` instead of `ERROR_FILE_NOT_FOUND`, which didn't happen for years in the past with older versions of Windows Server. Seems to me something has changed in Windows and one needs to deal with that additional error in some APIs now. So I hope Java gets compatible in future. – Thorsten Schöning Jul 07 '20 at 09:56
  • @dSebastien It's even possible that Windows is not the problem at all, but something low-level like a virus scanner interfering with requests to the file system or such. https://stackoverflow.com/questions/58825963/when-does-findfirstfilew-set-last-error-to-be-error-no-more-files-instead-of-err?noredirect=1&lq=1#comment103948605_58825963 – Thorsten Schöning Jul 07 '20 at 10:01
  • @dSebastien and Thorsten, did you try Polux2's answer below, about setting DirectoryCacheLifetime=0? We are experiencing the same behaviour. – AndrWeisR Apr 29 '21 at 23:58
  • @AndrWeisR I'm not in control of the server, it's a production system of a customer, so I wasn't able to check this myself. This is the reason why I didn't accept the answer yet, but only upvoted, because it makes sense. If you can test it yourself and things work, tell us and I will accept the answer. – Thorsten Schöning Apr 30 '21 at 07:33
  • 1
    @ThorstenSchöning We tried the registry setting on the client server, and so far it appears the "No More File" errors have stopped. – AndrWeisR May 03 '21 at 05:49

1 Answers1

6

This behavior is caused by an SMB incompatibility between Windows Server 2019 server (file server) and previous versions of Windows (clients). The cache of directory metadata is handled differently which causes this issue when reading a share with many files and folders.

Microsoft has unfortunately not yet released a fix for this bug.

A workaround is to disable the SMB metadata caching on the client side with this registry setting: HKLM\System\CurrentControlSet\Services\LanmanWorkstation\Parameters\DirectoryCacheLifetime=0 (DWORD)

Polux2
  • 552
  • 3
  • 12
  • That was tested to work at least once, so I'll accept it: https://stackoverflow.com/questions/58825588/does-java-need-to-support-error-no-more-files-when-canonicalizing-paths-on-windo?noredirect=1#comment119068554_58825588 – Thorsten Schöning May 03 '21 at 13:00
  • YES!! this answer saved my life. If only I had found it last week. – MichaelRom May 04 '21 at 16:38
  • 1
    is this issue also present in Windows Server 2022 ? – Somebody Mar 08 '23 at 15:12
  • Can someone link to the known issue? It may be that it is not the only combo of OSs that is affected. I'm doing research for such a case. – Doc Jul 26 '23 at 11:56