0

The problem.

I have a problem with some Java daemon simply copying files from one directory into another in theory, but under some setups those directories are provided as network shares of some Windows Server 2016. Under some circumstances it occasionally happens that copying fails because of Java throwing an IOException during trying to canonicalize paths and I'm somewhat sure that this happens because during that operation FindFirstFileW is used.

If a file is not found, that simply fails and sets ERROR_FILE_NOT_FOUND as last error. Java OTOH is prepared to get that error and even some special others and simply ignore them. What is NOT part of that list of ignored errors is ERROR_NO_MORE_FILES:

if ((errval == ERROR_FILE_NOT_FOUND)
    || (errval == ERROR_DIRECTORY)
    || (errval == ERROR_PATH_NOT_FOUND)
    || (errval == ERROR_BAD_NETPATH)
    || (errval == ERROR_BAD_NET_NAME)
    || (errval == ERROR_ACCESS_DENIED)
    || (errval == ERROR_NETWORK_UNREACHABLE)
    || (errval == ERROR_NETWORK_ACCESS_DENIED)) {
    return 0;
}

But for some reason Windows sometimes decide to set exactly that error as the last one, which can be seen using Process Monitor and that error perfectly fits to the error message Java provides in its stacktrace.

10:12:06,6244515        integration.exe 6928    QueryDirectory  \\HOST\SHARE$\DocBeam3\[...].zip  NO MORE FILES   Filter: 20191106-081920-[...].zip

vs.

19:08:03,7485947    java.exe    6232    QueryDirectory  C:\Users\[...].zip  NO SUCH FILE    Filter: 20191022-143101-[...].zip

Additional observations.

The interesting thing now is that the daemon doesn't fail always on each and every file copy, but only sometimes, somewhat rarely. But if it fails it seems to have to do with other directories and files being available in the target directory already. While those are completely unrelated to the daemon and according to ProcMon those don't get iterated or stuff, their pure existance seems to make a difference already. If I simply delete all of those files and directories and empty the target directory this way, copying instantly succeeds again. That's interesting because having files and directories in the target directory in my local setup doesn't seem to have any influence: Copying never fails and especially the event logged by ProcMon NEVER is ERROR_NO_MORE_FILES as well. After emptying the directory on the setup where the problem happens, ProcMon logs ERROR_FILE_NOT_FOUND again as well.

The question.

So it seems that for some reason under some currently unknown circumstances, Windows decides to use ERROR_NO_MORE_FILES as last error in the calls to FindFirstFileW used by wcanonicalize. Because Java doesn't have that on its exception list, copying fails in those circumstances, even if it seems to be a perfectly valid situation. I don't see any real error otherwise. Additionally, the pure existence of the list with different error codes already proves that FindFirstFileW is known to set different error codes under different circumstances.

So, when does FindFirstFileW set last error to be ERROR_NO_MORE_FILES instead of ERROR_FILE_NOT_FOUND?

I'm hoping to better understand what might trigger the problem at all this way.

Thorsten Schöning
  • 3,501
  • 2
  • 25
  • 46
  • 1
    set conditional bp, when `FindFirstFileW` return `ERROR_NO_MORE_FILES` or change src code if can, for catch such situation. but by design - this must not be. if no more files on initial query - returned error `STATUS_NO_SUCH_FILE` by filesystems (converted to `ERROR_FILE_NOT_FOUND`) by win32. and `STATUS_NO_MORE_FILES` on next query - https://github.com/microsoft/Windows-driver-samples/blob/master/filesys/fastfat/dirctrl.c#L801 – RbMm Nov 12 '19 at 20:50
  • 1
    @RbMm, in particular, see [\[MS-FSA\] 2.1.5.5.3 Directory Information Queries](https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-fsa/fa8194e0-53ec-413b-8315-e8fa85396fd8). The protocol explicitly specifies that `STATUS_NO_SUCH_FILE` must be returned if there are no matching files for the first query. A file system or redirector is buggy if it returns `STATUS_NO_MORE_FILES` in this case. – Eryk Sun Nov 13 '19 at 08:01
  • @ErykSun Is first vs. subsequent lookups associated with the handle used for `FindFirstFileW` or managed otherwise? Because Java itself is most likely not reusing that handle, so in theory shouldn't ever get anything else than `STATUS_NO_SUCH_FILE`. https://github.com/openjdk/jdk/blob/master/src/java.base/windows/native/libjava/canonicalize_md.c#L238 Nevertheless, it manages a list of different error codes already. – Thorsten Schöning Nov 13 '19 at 09:41
  • @ErykSun The lookup done by Java is client-side for a server-side resource using SMB. If `ERROR_NO_MORE_FILES` is correct because something used the handle already for some former lookup Java isn't aware of, did that happen most likely client- or server-side? Am thinking of AV-software running on client and server, even though I didn't see anything obvious in ProcMon. – Thorsten Schöning Nov 13 '19 at 09:43
  • 2
    `FindFirstFileW` encapsulates the native calls `NtOpenFile`, which returns a handle for a new File object that references the directory, and `NtQueryDirectoryFile` with the `ReturnSingleEntry` parameter as true, so it only returns the first result. The query state is associated with the File object. In the Windows API, the handle for this kernel object is encapsulated by a 'search handle'. The latter is actually a record that contains the real handle and a buffer that's used to read several entries per `NtQueryDirectoryFile` system call in subsequent `FindNextFileW` calls. – Eryk Sun Nov 13 '19 at 11:51
  • 1
    Note that both `NtOpenFile` and `NtCreateFile` create a new File object via `IoCreateFileEx`, for which `Io` is the kernel prefix for I/O manager routines. The I/O manager in turn implements a parse routine for Device objects (e.g. "\Device\Mup"). This parse routine opens a Device object as a new File object. If it's a file system, or is mounted by one, then the file system parses the remaining path and associates the object with file-system structures. Typically this is a shared file control block (FCB) and a per-File context control block (CCB). State for a directory query is in the CCB. – Eryk Sun Nov 13 '19 at 12:13
  • 2
    Thus when `FindFirstFileW` calls `NtOpenFile`, it gets a handle for a File object that's associated with a CCB that was created just for that File object. So there's no chance that the directory query state can be conflated with a File object that was created by an unrelated `FindFirstFileW` call. The directory queries are completely separate. – Eryk Sun Nov 13 '19 at 12:18
  • @ErykSun I didn't meant unrelated queries to different paths, but something sitting between `FindFirstFileW` and `NtOpenFile`. The latter returns a handle to that "something", that does something with the handle, changing its state this way and afterwards forwarding the handle further to `FindFirstFileW` to do its thing. "something" like AV-software. Would that be possible? – Thorsten Schöning Nov 13 '19 at 12:27
  • https://mail.openjdk.java.net/pipermail/core-libs-dev/2019-November/063437.html – Thorsten Schöning Nov 15 '19 at 08:01

0 Answers0