6

I know very little about DFS filesystems but have come across an issue with one of our deployments.

Our application writes files to a designated location, closes them, and then writes a record into the database. Another part of the application picks up these DB records and reads the file that was previous written.

In some cases the reader is getting a "file not found" and it fails. Restarting it without touching anything else and it finds the file correctly and everything is fine.

I believe I have ruled out a problem with our application as the file is definitely flushed/closed before the database record is created.

Therefore I'm led to believe that the OS or filesystem is delaying the file write internally so it isn't immediately available.

The filesystem in question is Windows 2003 SP2 DFS. Is this a likely scenario with this DFS? If so is it possible to switch it into some sort of write-through/no caching policy to ensure the files are written promptly?

Mike Q
  • 197
  • 7
  • 1
    What is the client OS? Are there multiple replicas configured in the DFS? – Harry Johnston Sep 27 '11 at 03:39
  • @Harry, client OS is Windows 2008 R2, multiple replicas - not sure but will find out shortly – Mike Q Sep 27 '11 at 09:00
  • The reason I ask about the client OS is that we've experienced some similar issues which I think indicate a bug in file sharing which may have been introduced with Windows Vista or Windows 7. I'm not sure yet whether the problem is at the client or server end. – Harry Johnston Sep 28 '11 at 20:55

2 Answers2

4

DFS is Distributed File System, which is exactly what the name says: a "virtual" file share that is distributed and replicated across multiple servers. Everytime your application writes to it, it's actually accessing one of its copies on one of the servers that are part of it, and if another application tries to read the same data soon after, it could very well be accessing another server, which didn't receive the updated data yet.

With DFS, you can't never be absolutely sure that data written to it will be available on a subsequent read: there could always be replication latency; you also don't have any way to tell your application to "talk" to a specific DFS server: it is free to connect to any one of the servers running it.

If you want this application to work in real time, you should use a standard file share, not a DFS.

Massimo
  • 70,200
  • 57
  • 200
  • 323
  • More info. The site is apparently using DFS in a non-replicated fashion. e.g. dfs/mydir1 on server1, dfs/mydir2 on server2. If they are not using replication does the above still hold true? Can there still be a delay in the filesystem when writing? – Mike Q Sep 28 '11 at 07:36
  • Are the first and second applications running on the same server, or on different servers? – Massimo Sep 28 '11 at 08:09
  • The file is being written to and read by the same application/process on a single machine. – Mike Q Sep 28 '11 at 08:18
  • Ok, this is quite strange then. Even if DFS replication and/or caching was involved, this shouldn't happen. Are you absolutely sure the database record is created *after* the file has been written and flushed? – Massimo Sep 28 '11 at 09:21
  • 99% sure yes. We have not observed this on any other deployments which use exactly the same setup except for the filesystem difference. Btw are you saying that even with DFS replication if you access the file from the same machine/process then you should always see an up to date file? – Mike Q Sep 28 '11 at 10:59
  • DFS in only a service which replicates data across multiple servers; it's not a different file system. So, yes, if you are writing and then reading on the same server, you *really* should see up-to-date files. – Massimo Sep 28 '11 at 11:54
  • let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/1459/discussion-between-mike-q-and-massimo) – Mike Q Sep 28 '11 at 12:40
-2

You are making the common, but incorrect, assumption that there is one universal notion of "after" and that if you do one thing "after" another, you are guaranteed to see the effects. This is simply a false notion, and nothing you can do will ever make it work the way you expect.

The analogy would be sending someone a letter, getting back a return receipt, calling the person on the phone and assuming they must have read the letter.

As you mentioned, delayed writes will screw this up. Many other things can screw it up too. Trying to find every possible way it can break and fix them all is just crazy.

Instead, if you need ordering between operations, use something that is specifically guaranteed to provide the particular ordering you need. Since there is no guaranteed ordering between the filesystem and the database, it won't do.

Most filesystems do provide guaranteed ordering with respect to themselves and their own operations when accessed through processes running on the same operating system instance. So after the file is correctly set up, you can create a 'trigger' file in the same filesystem. If the reader sees the trigger file, then it can know that the data file is complete and valid. It can remove the trigger file when it's done.

David Schwartz
  • 31,449
  • 2
  • 55
  • 84
  • This is not true. Of course the O.S. (or the controller) might be caching disk writes, but any decent caching algorithm will make sure to return the most up-to-date result on subsequent reads, even if that data only resides in che buffer cache and haven't been physically written to the disk yet. – Massimo Sep 26 '11 at 22:07
  • @Massimo: That's exactly why I said, "Most filesystems do provide guaranteed ordering with respect to themselves and their own operations when accessed through processes running on the same operating system instance." What do you think I said that is not true? – David Schwartz Sep 27 '11 at 01:41
  • You said "there is no guaranteed ordering between the filesystem and the database"; this is not true, and if the client application was reading from the *same* filesystem, it would indeed find the data it was looking for. What's screwing up things here is the fact that he's (unknowingly?) using a distributed and replicated filesystem. This is like saying "it's normal to create an Active Directory account and not see it after a while"; no, this is not normal *per se*: it can happen only if you look for it on a different DC and in a different AD site, before replication takes place. – Massimo Sep 27 '11 at 05:57
  • Why is that not true? What standard, documentation, or argument provides the guarantee that there's ordering between the filesystem and the database? And the second part of your comment is really, really bad advice. Ordering simply isn't guaranteed. If it was, what he was doing would already work. Trying to synthesize a guarantee by avoiding everything you suspect might cause your method to break is suicide. The environment can always find some other way to break what you did. That's why standards provide guarantees -- so you don't have to guess and hope. – David Schwartz Sep 27 '11 at 06:00
  • This is not a problem of "ordering between the filesystem and the database"; he's not doing both things at the same time. He's doing a filesystem write, *closing and flushing it*, and *then* he's doing a database write. The filesystem write is already done when he accesses the DB. The problem is, since this is a *distributed* filesystem, succesfully completing a write operation is not enough, as *then* the data needs to get replicated to other servers, and *this* can take a while. – Massimo Sep 27 '11 at 06:21
  • Which is also true for my AD example, or any other example which involves distributed applications without a single, central database; even a Facebook wall post can behave this way: you see it immediately after posting it, other people may have to wait some seconds or even minutes. – Massimo Sep 27 '11 at 06:22
  • @Massimo: It is a problem of ordering. I agree, he's not doing both things at the same time. **He's** doing one, then the other. And as I clearly pointed out in my answer, he's assuming that this ordering has global significance. The particular way it burned him was that this was a distributed filesystem. But the point is to do things that are **guaranteed** to work. Not to find every way something can break and then fix it. (And I agree, it's a problem with lots of things. You cannot assume ordering between X and Y unless something guarantees it.) – David Schwartz Sep 27 '11 at 06:49
  • @Massimo: I am baffled that you are standing up for an elementary programming error. When you don't have guaranteed ordering, it doesn't even mean anything to say "X happened after Y". That only makes sense if global ordering is guaranteed. He has no such guarantee between his filesystem and his database. So it means nothing to say the database operation occurred "after" the file write. There is no global ordering that includes both the database and the filesystem. They are each free to delay or reorder their operations and guarantee only internal consistency with their own responses. – David Schwartz Sep 27 '11 at 06:51
  • while I agree it's difficult to have a "global" ordering when more than one system is involved, I strongly disagree with your idea that there is no global ordering *on a single system*. If you perform a filesystem write and it succeeds, regardless of whatever caching system may be involved, the O.S. *guarantees* that any other process running on the system (including the file sharing service!) will read the mot current data; otherwise, there would be no point at all in having all disk read/writes going through the O.S. kernel and storage subsystem. – Massimo Sep 27 '11 at 07:33
  • @Massimo: But he is relying on ordering between his filesystem and his database. He performs a filesystem operation and then a database operation and he assumes that anything that can see the database operation can see the filesystem operation. This requires the database operation to come after the filesystem operation in a global ordering that includes the filesystem and the database. No such ordering exists. This is an elementary programming error and I'm baffled that you continue to defend it. There is no global sequence point between the two operations, they are freely reorderable. – David Schwartz Sep 27 '11 at 07:39
  • I really don't understand why you are defending this position. As soon as the filesystem write operation is completed, IT IS COMPLETED, and the O.S. wil **guarantee** that any subsequent filesystem read will be able to see the same data. Even if the *actual* disk write operation will eventually happen *after* the DB access (due f.e. to caching), the O.S. will make sure everyone reads up-to-date data from the filesystem. As such, there actually **is** a logical ordering between operations, even if a physical one isn't actually enforced. – Massimo Sep 27 '11 at 09:41
  • @Massimo: That's not the issue. The filesystem is ordered with respect to itself, yes. The issue is the ordering of the filesystem with respect to the database. His false assumption is that if he does a filesystem operation and then a database operation, anyone who sees the database operation must see the filesystem operation. That is the error. My suggested change was to rely only on the filesystem being ordered with respect to itself. As you say, that is guaranteed. – David Schwartz Sep 27 '11 at 16:41
  • The filesystem is ordered. So it means something to talk about one filesystem operation occurring before another. But there is no global order that includes both the filesystem and the database. So he cannot say his first filesystem operation takes place before the database operation in some kind of global ordering because no such global ordering exists. If there was a global ordering, and the database operation took place before the filesystem in it, his method would clearly fail. So it must also fail if no ordering exists. – David Schwartz Sep 27 '11 at 16:44
  • He is doing a database operation after a filesystem operation *has been completed*, i.e. after the system call for writing to the filesystem returned "success". Regardless of what can happen to the database, after the O.S. kernel returns "ok, this has been saved" to its caller, anyone and anything reading from the filesystem will see up-to-date data... because it has to go through that exact same kernel in order to read it. Even if the data is only in the O.S.'s cache and not actually on disk, the O.S. will take care of handling this and never give out out-of-date things. – Massimo Sep 27 '11 at 18:00
  • let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/1453/discussion-between-david-schwartz-and-massimo) – David Schwartz Sep 27 '11 at 22:54