Our product holds a write-exclusive file handle on each opened document file to ensure that we have exclusive write control over the file.
Hence, Windows won't allow any other process to do more than read from the file, nor can the file be deleted from Explorer or another process - because of the open (and write-exclusive) handle on it.
We've been having trouble however with very odd edge cases where the file's contents become corrupted. I think it has to do with bugs or possibly my misunderstanding of what is guaranteed from Windows APIs - i.e. in order save a design file over the top of a previous version - for which we currently hold a file handle - I have to rewind the handle to start of file, write it out, and then force it to flush & truncate at that new position (in case the file shrunk - we don't want extra sludge on the end of our file - that too would be a form of corruption). Do this multiple times during a session - each time the user edits then saves their changes...
However, sometimes our customers are reporting corrupt files result from all of this (only over the network - never locally).
We think that this might be due to the fact that our actual save process is slightly more complex:
- rewind the (already open) file handle
- write out the core data
- flush & truncate the handle
- fseek to the end of the file
- write out the thumbnail image data (essentially - append a thumbnail)
- flush & truncate the handle
It might just be a case of "don't flush, seek, write, flush" - that this introduces subtle bugs in MS's networked filesystem code (or relies on uncertainties built-into the system - and cannot be reliably relied upon)?
So, I'm implementing a two-layer fix:
- doing a single rewind, then write core data + image data + flush & truncate (once)
- doing a save-as-temp, close, rename
No. 2 has some nice features - such as "if there is a problem writing out the new file, the old one remains untouched." That means at worst their new data isn't saved, but no old data is lost.
It's a basic use of the classical pattern of "build a new copy then swap it into the real / active data structure."
Great - but what I don't know is how to "swap contents of files"?
I can do the classical:
- Write T (temp) fully and close it.
- Rename A (actual) file to A.bak.
- Rename T to A
(and of course I'll need to delete any previous A.bak first).
This is fine - but again - we normally have a locked handle on A. So this expands to a somewhat imperfect:
- Write T
- Close our handle on A
- Rename A to A.bak
- Rename T to A
- Acquire a write-exclusive handle on A
What I dislike about this is "too many moving parts."
- between 2 and 5, anyone else can grab a lock on A or otherwise get in our way.
You don't think it will happen - but then file system indexing or antivirus or backup software can all get in the way and very - very - very often do (in our experience).
So - ideally, I don't want to let go of control of A at any point! I want to ensure each hand-off is impervious to Antivirus or other software from getting in there and boning things up.
Ideally, in fact, I'd:
- Write T
- Swap guts of T and A (ask the file system to actually link name A to contents of T)
- Live happily forever more...
So, is there a pattern that others have discovered for swapping T and A?
Is there a set of API calls to make this better / more robust?
Other thoughts entirely that might help rethink my approach?
NOTE: MS has deprecated transactional filesystem API. So that sounds like a non-starter - not to mention it isn't available on all filesystems under Windows anyway.
Update: FWIW, I implemented this as write temp file, rename original, rename temp to real, delete original (plus the necessary unlock and obtain new lock) using RAII and ScopeGuard to handle any fail-roll-backs, though of course the rollbacks - being side-effect and OS-dependent, are "best case scenarios" and not as well guaranteed as the C++ language contracts themselves. Still, during testing it was quite effective - never giving me a bad file (and I created intentionally and unintentionally a number of issues that created a bad temp file or otherwise errored (threw an exception) during that algorithm invoking the unroll process).
Update 2: "Final" algorithm is to
1. (save to a temporary local verify copy)
2. save to a temporary new file
3. (verify the new save and the verify match)
4. drop our lock on the real file
5. rename the real file to temporary old file and replace the original file with the temp file (this includes transferring attributes, ACLs, and timestamps - see ReplaceFile())
6. obtain our lock (if it was locked)
7. Success (discard our guards)