I'm writing a specialized database-like thingie targeting Linux. It needs to support atomic and durable modification.
I'm trying to implement this using write-ahead logging, but having trouble figuring out how to properly atomically commit the log to stable storage.
Currently, within the log record header I have a single byte for the is_committed
flag, which indicates that the record is fully written to stable storage (to protect against failures mid-append) and the commit procedure goes like this:
- Append the complete log record with the
is_committed
byte set to0
. fsync
the log file.- Do a single-byte write setting
is_committed
byte to1
. fsync
it again.
Since single-byte write which modifies only one bit is surely atomic, if something fails at any point the stable storage will either contain full and correct record or is_committed
byte will be 0
. So this appears to work as intended.
However, consider what happens if second fsync
reports an error for whatever reason. On one hand, this means that it was unable to propagate is_committed == 1
to stable storage. On the other hand, since the preceding write succeeded, is_committed == 1
remains in the file cache, and can still be written out to stable storage by the OS at any point as usual.
So, while this procedure does provide durability and atomicity in the above sense, in this case there is no way to tell whether it has succeeded or failed. Therefore it appears a primitive is required that will perform steps 3 and 4 atomically.
pwritev2
with RWF_SYNC
flag, which can be used to do steps 3 and 4 in a single syscall is a candidate for such a primitive. However, it's not clear if this is just saving a syscall or is actually atomic in the sense required.
So, how does one do an atomic write and fsync to stable storage while being able to tell whether it has succeeded or failed on Linux? Or am I misunderstanding something and the commit procedure should be different?