I want to know what's the optimal way to log to an SSD. Think of something like a database log, where you're writing append-only, but you also have to fsync() every transaction or few transactions to ensure application level data durability.
I'm going to give some background on how SSDs work, so if you already know all this, please skim it anyway in case I am wrong about something. Some good stuff for further reading is Emmanuel Goossaert 6-part guide to coding for SSDs and the paper Don't Stack your Log on my Log [pdf].
SSDs write and read in whole pages only. Where the page size differs from SSD to SSD but is typically a multiple of 4kb. My Samsung EVO 840 uses an 8kb page size (which incidentally, Linus calls "unusable shit" in his usual colorful manner.) SSDs cannot modify data in-place, they can only write to free pages. So combining those two restrictions, updating a single byte on my EVO requires reading the 8kb page, changing the byte, and writing it to a new 8kb page and updating the FTL page mapping (a ssd data structure) so the logical address of that page as understood by the OS now points to the new physical page. Because the file data is also no longer contiguous in the same erase block (the smallest group of pages that can be erased) we are also building up a form of fragmentation debt that will cost us in future garbage collection in the SSD. Horribly inefficient.
As an asside, looking at my PC filesystem:
C:\WINDOWS\system32>fsutil fsinfo ntfsinfo c:
It has a 512 byte sector size and a 4kb allocation (cluster) size. Neither of which map to the SSD page size - probably not very efficient.
There's some issues with just writing with e.g. pwrite()
to the kernel page cache and letting the OS handle writing things out. First off, you'll need to issue an additional sync_file_range()
call after calling pwrite()
to actually kick off the IO, otherwise it will all wait until you call fsync()
and unleash an IO storm. Secondly fsync()
seems to block future calls to write()
on the same file. Lastly you have no control over how the kernel writes things to the SSD, which it may do well, or it may do poorly causing a lot of write amplification.
Because of the above reasons, and because I need AIO for reads of the log anyway, I'm opting for writing to the log with O_DIRECT and O_DSYNC and having full control.
As I understand it, O_DIRECT requires all writes to be aligned to sector size and in whole numbers of sectors. So every time I decide to issue an append to the log, I need to add some padding to the end to bring it up to a whole number of sectors (if all writes are always a whole number of sectors, they will also be correctly aligned, at least in my code.) Ok, that's not so bad. But my question is, wouldn't it be better to round up to a whole number of SSD pages instead of sectors? Presumably that would eliminate write amplification?
That could burn a huge amount of space, especially if writing small amounts of data to the log at a time (e.g a couple hundred bytes.) It also may be unnecessary. SSDs like the Samsung EVO have a write cache, and they don't flush it on fsync(). Instead they rely on capacitors to write the cache out to the SSD in the event of a power loss. In that case, maybe the SSD does the right thing with an append only log being written sectors at a time - it may not write out the final partial page until the next append(s) arrives and completes it (or unless it is forced out of the cache due to large amounts of unrelated IOs.) Since the answer to that likely varies by device and maybe filesystem, is there a way I can code up the two possibilities and test my theory? Some way to measure write amplification or the number of updated/RMW pages on Linux?