Here is an article on lwn.net that discusses potential data loss when a program fails to do adequate syncing (otherwise known as crash-consistency) on ext4 at length (the comments discussion is enlightening as well).
ext3 apparently achieves a better crash consistency when using data=ordered
because it forces data to disk before metadata changes are committed to journal. Also, a default commit period of 5 seconds is used. In case of ext4, a trade-off for performance is done which uses a delayed physical block allocation model thus causing uncommitted data to continue to living in the cache for some time. A quote from the article:
The kernel doesn't like to let file data sit unwritten for too long, but it can still take a minute or so (with the default settings) for that data to be flushed - far longer than the five seconds normally seen with ext3
So unwritten data can theoretically exist only in a volatile cache until it is forced to disk by a system wide sync
OR an application's explicit fsync
of its own data (as Jeffery has pointed out). If the application/client doesn't do this we are more prone to data loss.
One way of mitigating this issue is to mount the required filesystem with the sync
option (refer this "ext4 and data loss" discussion thread) and to do so we have to mandate it in two places:
- The mount into the pod
- The OpenEBS storage pool OR the backend store
(In case of 1, we could have the target convert all writes to sync, as explained by Jeffery)
While the mount(8)
documentation specifically states that using -o sync
is only supported until ext3
(among the ext family of filesystems), a manual filesystem mount with this option is accepted. In an attempt to check whether it is something that the mount protocol allows but is ignored by ext4, I conducted a small fio-based random write performance test for a data sample size of 256M with a disk mounted with the sync option and the same with one without it. To ensure that the writes themselves were not SYNC writes, the libaio ioengine was selected with direct=1
and iodepth=4
(asynchronous multithreaded unbuffered I/O). The results showed a difference of around 300+ IOPS (of course, with the non sync
mount performing better). This result suggests that the sync
mount flag does seem to play a role but I'm still looking for more proof on this.