How to do atomic file write & fsync to stable storage on Linux?

Question

I'm writing a specialized database-like thingie targeting Linux. It needs to support atomic and durable modification.

I'm trying to implement this using write-ahead logging, but having trouble figuring out how to properly atomically commit the log to stable storage.

Currently, within the log record header I have a single byte for the is_committed flag, which indicates that the record is fully written to stable storage (to protect against failures mid-append) and the commit procedure goes like this:

Append the complete log record with the is_committed byte set to 0.
fsync the log file.
Do a single-byte write setting is_committed byte to 1.
fsync it again.

Since single-byte write which modifies only one bit is surely atomic, if something fails at any point the stable storage will either contain full and correct record or is_committed byte will be 0. So this appears to work as intended.

However, consider what happens if second fsync reports an error for whatever reason. On one hand, this means that it was unable to propagate is_committed == 1 to stable storage. On the other hand, since the preceding write succeeded, is_committed == 1 remains in the file cache, and can still be written out to stable storage by the OS at any point as usual.

So, while this procedure does provide durability and atomicity in the above sense, in this case there is no way to tell whether it has succeeded or failed. Therefore it appears a primitive is required that will perform steps 3 and 4 atomically.

pwritev2 with RWF_SYNC flag, which can be used to do steps 3 and 4 in a single syscall is a candidate for such a primitive. However, it's not clear if this is just saving a syscall or is actually atomic in the sense required.

So, how does one do an atomic write and fsync to stable storage while being able to tell whether it has succeeded or failed on Linux? Or am I misunderstanding something and the commit procedure should be different?

Would opening the log file with O_SYNC solve this, as writes would effectively be as though fsync was called directly after? — Rafael, Jul 07 '22 at 03:55
@Rafael I don't know - it's the same issue as `pwritev2` with `RWF_SYNC`: it's not clear if `write` and `sync` parts of the composed operation can fail separately. — yuri kilochek, Jul 07 '22 at 08:04
From your own linked reference: "A write ahead log is an append-only [...]". One doesn't change something in place. We append to the log. But it's not clear what role your header bit plays in your implementation. It's not clear what you mean by "atomically commit the log to stable storage". Or exactly what are doing that accomplishes write-ahead logging. PS Maybe you are asking about two-phase commit? Or if your question is really just about OS calls, how has that not been answered by previous Q&A? PS "surely" typically means "I'm not sure". PS Please clarify via edits, not comments. — philipxy, Jul 07 '22 at 08:11
@philipxy `is_committed` flag indicates that the record is fully appended to the log. It is needed to detect failures mid-append and ignore such incomplete records when recovering. No, this is not a distributed system and has nothing to do with two-phase commit. This is a question about dealing withe the outlined issue in general, but a right syscall might be the answer. I'm not sure what previous Q&A you refer to, none appear to consider this. I'm using "surely" to mean "i'm fairly certain, but allow for the possibility that I'm wrong" which is _surely_ correct ;). — yuri kilochek, Jul 07 '22 at 09:40
"not distributed"--a distributed system is one where things are not guaranteed to happen. Two-phase commit serves a certain purpose. "previous Q&A"--previously posted SO Q&A. — philipxy, Jul 07 '22 at 13:58

How to do atomic file write & fsync to stable storage on Linux?

0 Answers0