Choosing/configuring a database for high-throughput, reliable, consistent write throughput, sacrificing latency

Question

I'm working on a real-time application with the following characteristics:

Hundreds of clients will be inserting rows/documents at the same time, each inserting a row every couple of seconds.
Largely append-only; almost all rows/documents, once inserted, are never changed.
A client should only see success when data has been flushed to disk, and thereafter read-your-writes consistency should hold.
Clients are willing to wait on the order of seconds for confirmation - long enough for multiple disk seeks and writes to happen.
There is too much data to fit in RAM (ruling out options like Redis). But written-long-ago rows are rarely accessed, so it's acceptable to not have them in memory.
Ideally, these writes should not block reads.
A key-value store is fine, but there needs to be at least one reliable auto-incrementing index.

In other words (and tl;dr), the clients can tolerate latency, but they need a lot of trustable write throughput - more throughput than "one write is one disk operation."

I'm envisioning a database that would be implemented something like this: accept a (theoretically limited by the number of file descriptors) number of TCP connections, buffer those writes in memory, log batches of them to disk as often as possible (along with the updates to the auto-incrementing index), and only respond to those TCP connections when the associated disk write operation is complete. Or it could be as simple as a lazily-writing database publishing a message that it has done a disk write (clients wait for the lazy response, then wait for the write message to report success).

I think that with such high latency tolerance, this isn't asking too much. And I'd imagine that others have had this problem, such as financial companies that can't afford to lose data, but can afford to delay the response to any one client.

Do any battle-tested database solutions like Postgres, CouchDB/Couchbase, or MongoDB support modes of operation like this?

Craig Ringer · Accepted Answer · 2012-10-11T06:29:38.100

PostgreSQL should fit this workload pretty well; pretty much everything you've specified is well within its normal feature set. Pg is ACID compliant, supports group commit to reduce sync overhead, writers don't block readers, and it uses the OS for caching so it'll naturally tend to keep only the hot data set in memory.

"Clients are willing to wait on the order of seconds for confirmation - long enough for multiple disk seeks and writes to happen"

If considering PostgreSQL your application is ideally suited to a really big commit_delay, which will help immensely with write throughput. You can't use synchronous_commit = off because you need confirmation of commit before reply, but you can just hold commits queued up for a few seconds to save on sync costs.

If you use Pg for a job like this you will need to tune checkpointing to make sure that checkpoints don't stall I/O. Make sure the bgwriter is aggressively writing out dirty buffers. Ensure autovaccum is running frequently - you aren't deleting from the tables, but the indexes still need maintenance, and so do the table statistics.

If you're expecting lots of data and your queries usually have a temporal element, consider partitioning the table into chunks of (say) 1 month for the 1st year, consolidating everything older than 12 months into tables partitioned by year. Pg has only limited built-in partitioning (it's hacked together using inheritance and constraint exclusion) so you have to do it by hand/script using triggers, but it does the job.

See:

Exactly the type of answer I was looking for! I'm a bit put off by the documentation saying `Since all pending commit data will be written at every flush regardless of this setting, it is rare that adding delay by increasing this parameter will actually improve performance` - but I'm assuming that my use case is one of those 'rare' cases? At any rate, I'll need to do a lot of reading up on this and testing/tuning, but this looks very promising. — btown, Oct 12 '12 at 07:37
@btown You certainly need to test and benchmark before making any decisions. I find that bit in the docs a bit unclear; I suspect it may refer to the fact that any non-delayed commit will cause delayed commits to flush to disk too. I'll be interested in your results. — Craig Ringer, Oct 12 '12 at 07:59
@btown BTW, the single best thing you can do for this kind of workload is to make sure your storage has very fast syncs. A RAID controller with a battery backed cache in write-back mode is the cheapest option. You won't believe the difference it makes. A good SAN is the pricier option. Whatever you do, do *not* run this kind of workload on something like EC2. If you use a RAID controller with BBU they're not all the same; benchmark or ask on the pgsql-general mailing list. Test your BBU periodically if using one to make sure the battery still works. — Craig Ringer, Oct 12 '12 at 08:04

Choosing/configuring a database for high-throughput, reliable, consistent write throughput, sacrificing latency

1 Answers1