I'm using a network of Linux (Debian Squeeze on kernel 2.6.32) machines, sharing files using NFS (v3). The scenario is that a process running on client A will create a file through NFS on file server Z. Then after the process is finished running on A (flushing its output and closing the file), client B will try to access the file. 99.9% of the time there's no problem with this approach.
The problem is that the very rarely client B when attempting to read will throw an error saying that the file does not exist. The wrinkle is that B always shows the file when an "ls" or readdir is made on the containing directory. However when trying to open, or even calling "stat" on the file the does not exist error is thrown.
Some additional points:
- The files are only written once on a single client, but can be read many times by many different clients (WORM). The files are also never deleted in the process.
- When the errors show up, they only appear to affect some of the clients (randomly). Other clients can open and read the files without a problem. Furthermore when the problem occurs it tends to occur repeatedly. Rebooting the file server and re-mounting on the clients seems to eliminate the problem
- The file is able to be read after enough time. It could be anywhere from a few seconds to ten minutes. Sometimes it will go away immediately after a readdir on the containing directory, sometimes it will not.
- I initially suspected it was an NFS attribute cache coherence issue. So I remounted with noac option enabled. The problem continued to pop up (in addition to be grindingly slow).
- The problem only appears during heavy NFS traffic when a lot of large files are being created, written and read.
- Nothing indicating a problem appears in any of the syslogs or dmesg on either the client or server side.
I strongly suspect this is an NFS cache coherence issue of some type. But I can't figure out what the exact cause or possible solution might be. Unless I'm misunderstanding the NFS manual, this type of behavior should be precluded by close-to-open cache coherence. Has anyone else had experience with this problem of NFS files that exist to the "readdir" system call, but don't exist to the "stat" system call? Any insight would be greatly appreciated. Thanks.