standalone mysql 5.7 instance on a preemptible node, which settings can make it run in the most fault tolerant way possible?

Question

Simplifying, I have a standalone mysql instance running on a docker container hosted on a preemptible node. That means that at least ever 24 hours, the underlying node is killed, and a clean shutdown is never guaranteed.

Before going this route, I tested the setup by simulating this scenario on different loads, also write operations, and killing the underlying nodes by triggering a kernel panic: no issues even after thousands of restarts.

On the real world, from time to time - let's say once a month over 3000 mysql instances - a DB gets corrupted, and it needs recovery (force recovery, full dump, reload of the dump).

What is the best set of options I could configure to make sure mysql operates in a way where, even with frequent kills of the server, it will not write inconsistent data? Sacrificing performance is not an issue.

The disk is a google cloud compute engine "standard persistent disk". This is the config it currently has (mysql is running on docker, so it requires aio = OFF):

max_connections = 60
innodb_buffer_pool_size = 16M
tmp_table_size = 4M
key_buffer_size = 8M
query_cache_size = 4M
query_cache_limit = 512K
thread_stack = 128K
performance_schema = 0
show_compatibility_56 = 1
innodb-use-native-aio = OFF

I am looking for my.cfg settings to make mysql as crash-proof as possible, not architectural solutions

Edit 25/03/21 I am currently testing the following:

innodb_flush_log_at_trx_commit = 1
sync_binlog = 1
innodb_flush_method = O_DIRECT

The first two are defaults, added just to be sure. It looks like O_DIRECT is the only step closer to the goal. Interesting doc on mysql writing to google cloud disks and data sync techniques: https://dotmanila.com/2017/09/o_dsync-flush-method-for-mysql-on-google-cloud/

Try looking at the following https://youtu.be/0CqMv0ucqFA?t=2269 it talks about crash recovery for Mysql. — FanofSam, Mar 18 '21 at 17:56

score -1 · Answer 1 · answered Mar 18 '21 at 22:42

3 servers for each of your 3K instances. Optionally each of these 9K MySQL instances would be in VMs or docker or whatever.

But... The 3 instances for each dataset must be in 3 geographical locations. And be in a Galera cluster of those 3 nodes. (InnoDB clustering is another option.)

Any instance can be killed abruptly. It could either be revived or you could even put an 'empty' machine in its place. Galera will recover -- either by incrementally updating (IST) the revived instance or by starting from scratch (SST).

Even the entire datacenter going down should be fully recoverable. That datacenter might have 1 node of each of your 3K clusters, but it must not have two nodes of any cluster.

That handles any "single point of failure" -- a disk, a server, even a datacenter.

One could even argue that this obviates the need for RAID-1/5/6/10 on the disks.

I haven't thought through all the details, but I think 5 Galera nodes spread across 5 datacenters will be fully safe against any 2 points of failure.

Don't forget to worry about network failures. For example, do the datacenters have two cables connecting to the Internet (or private cable)? And software failres.

The point here is consistency of the binlog and data... since every 24 hours the node is killed. — Michele, Mar 19 '21 at 08:44
@Michele - Clients need to check for errors after every SQL statement and gracefully handle the error that could occur when the server dies. — Rick James, Mar 19 '21 at 14:13
that's not relevant either, as you can see in the original answer. — Michele, Mar 22 '21 at 12:00

standalone mysql 5.7 instance on a preemptible node, which settings can make it run in the most fault tolerant way possible?

1 Answers1