I am working on a greenfield project for a “cloud-native” DBMS, with “cloud-native” meaning that the guarantees (e.g. ACID) it makes will depend on the presence of certain backing IaaS services (e.g. object storage, managed message queues, etc.) The goal is to reduce the codebase size and ops overhead of the DBMS, for cases where you’re already going to be running in an IaaS environment anyway.
One feature any DBMS needs, is a Write-Ahead Log (WAL) to replay state after a crash. The naive, “cloud-oblivious” way to implement a WAL is to just make it a file on disk that the DBMS daemon manages. In a cloud setting, this implicitly translates to the WAL log living either in a locally-attached “ephemeral” disk, or in a SAN (e.g. EBS, GCE PD) volume attached to the VM’s hypervisor over something like iSCSI. (And, as WALs are for crash-recovery, we can ignore the ephemeral-disk option; if the crash was because the instance failed, the disk would be gone!)
WALs have particular semantics:
a WAL is owned by one process/job; nothing else will ever read or write it (i.e. it can be considered permanently “exclusively locked” by its owner)
the only writes are appends (which might be translated to overwrites in a ring-buffer file, but this is an implementation detail)
there is no mixed read/write traffic; the WAL is only ever opened only for reading, or opened only for writing, with no switching occurring during a session
read sessions are rare (only for crash-recovery) and are always a streaming read of the entire WAL, starting from the first available (i.e. not garbage-collected) segment
the WAL’s writer can acknowledge that everything in the WAL up to a given checkpoint has been committed to its final destination. This can allow everything before the checkpoint to be marked for garbage-collection or overwriting
Given these semantics, I’m wondering if there is some other IaaS infrastructure-component that would be a better fit for handling WAL write and crash-recovery traffic—better than a SAN volume would.
By “better fit”, I mean some combination of these considerations:
a streaming protocol could be used to communicate with this infra-component, that more closely matches WAL semantics than a block-storage protocol like iSCSI does, decreasing the overhead on the instance;
the WAL, given its essential nature in crash-recovery, would be less likely to be corrupted than it would on a single SAN volume;
the solution would be lower-cost per GB of WAL data written than the cost for the SAN volume.
(I probably can’t have all three, but two out of three would be nice.)
Two classes of infra-components that seem to work for this, but don’t really, are durable Message Queue services (AWS SQS; Google Cloud Pub/Sub) and object storage (S3, GCS).
Both of these service types allow for quick writes of small messages/objects, and will then durably persist/replicate them; but neither service-type will persist a message that has been only partially written, and they’re also far too costly for this use-case, even compared to a SAN disk. A WAL can have multiple TBs of data flowing through it per day, and both object stores and message queues have, essentially, a cost-per-checkpointed-write, making WALs possibly the most expensive thing to store in them.