1

Simplifying, I have a standalone mysql instance running on a docker container hosted on a preemptible node. That means that at least ever 24 hours, the underlying node is killed, and a clean shutdown is never guaranteed.

Before going this route, I tested the setup by simulating this scenario on different loads, also write operations, and killing the underlying nodes by triggering a kernel panic: no issues even after thousands of restarts.

On the real world, from time to time - let's say once a month over 3000 mysql instances - a DB gets corrupted, and it needs recovery (force recovery, full dump, reload of the dump).

What is the best set of options I could configure to make sure mysql operates in a way where, even with frequent kills of the server, it will not write inconsistent data? Sacrificing performance is not an issue.

The disk is a google cloud compute engine "standard persistent disk". This is the config it currently has (mysql is running on docker, so it requires aio = OFF):

max_connections = 60
innodb_buffer_pool_size = 16M
tmp_table_size = 4M
key_buffer_size = 8M
query_cache_size = 4M
query_cache_limit = 512K
thread_stack = 128K
performance_schema = 0
show_compatibility_56 = 1
innodb-use-native-aio = OFF

I am looking for my.cfg settings to make mysql as crash-proof as possible, not architectural solutions

Edit 25/03/21 I am currently testing the following:

innodb_flush_log_at_trx_commit = 1
sync_binlog = 1
innodb_flush_method = O_DIRECT

The first two are defaults, added just to be sure. It looks like O_DIRECT is the only step closer to the goal. Interesting doc on mysql writing to google cloud disks and data sync techniques: https://dotmanila.com/2017/09/o_dsync-flush-method-for-mysql-on-google-cloud/

Michele
  • 11
  • 2

1 Answers1

-1

3 servers for each of your 3K instances. Optionally each of these 9K MySQL instances would be in VMs or docker or whatever.

But... The 3 instances for each dataset must be in 3 geographical locations. And be in a Galera cluster of those 3 nodes. (InnoDB clustering is another option.)

Any instance can be killed abruptly. It could either be revived or you could even put an 'empty' machine in its place. Galera will recover -- either by incrementally updating (IST) the revived instance or by starting from scratch (SST).

Even the entire datacenter going down should be fully recoverable. That datacenter might have 1 node of each of your 3K clusters, but it must not have two nodes of any cluster.

That handles any "single point of failure" -- a disk, a server, even a datacenter.

One could even argue that this obviates the need for RAID-1/5/6/10 on the disks.

I haven't thought through all the details, but I think 5 Galera nodes spread across 5 datacenters will be fully safe against any 2 points of failure.

Don't forget to worry about network failures. For example, do the datacenters have two cables connecting to the Internet (or private cable)? And software failres.

Rick James
  • 2,463
  • 1
  • 6
  • 13
  • The point here is consistency of the binlog and data... since every 24 hours the node is killed. – Michele Mar 19 '21 at 08:44
  • @Michele - Clients need to check for errors after every SQL statement and gracefully handle the error that could occur when the server dies. – Rick James Mar 19 '21 at 14:13
  • that's not relevant either, as you can see in the original answer. – Michele Mar 22 '21 at 12:00