LXD + Galera Cluster + Max Scale : Shutting down server != Stopping containers

Question

I have a galera cluster with 4 nodes. 2 in one server (server-master) and 2 in another server (server-slave).

The clusters are controlled through "Max Scale".

All the setup seems correct as replication is working fine, SHOW STATUS LIKE 'wsrep_cluster_size' reports correct size in all nodes, shutting the master is successfully transfered to the next node, etc.

maxscale server status reports: (summarized for simplicity)

Master, Synced, Running | Slave, Synced, Running | Slave, Synced, Running

If I execute stop both containers at the same time in "server-master", master DB is successfully assigned to the first container in "slave".

maxscale server status reports:

Down | Down | Master, Synced, Running | Slave, Synced, Running

The problem is: if I shutdown server-master,

maxscale server status reports:

Down, Down, Running, Running

And trying to connect to the cluster results in connection failed. After sometime, all nodes are reported Down.

I don't understand why shutting down the server doesn't work as expected.

UPDATE

I discovered that if I turn off the second node in "server-master", and then I shutdown the server,"master" is successfully assigned to "server-slave", however after few minutes all nodes go down. :/

Servers: Ubuntu Servers 16.04 x64
MaxScale version: 2.0.5
LXD version: 2.13
Galera version (3): 25.3.20-xenial
Guide followed: https://www.digitalocean.com/community/tutorials/how-to-configure-a-galera-cluster-with-mariadb-10-1-on-ubuntu-16-04-servers

A Galera cluster of 4 can't survive the simultaneous loss of 2 nodes, since the 2 remaining nodes will assume they're in a split brain condition. Shutting down b then a should allow c and d to survive if enough time passes between the loss of b and a. But it isn't yet clear whether the remaining nodes are really down, or pretending to be down (curled up into a safe little protective ball, alive but refusing to service queries due to a perceived partitioning event) or MaxScale is only detecting them as such. — Michael - sqlbot, May 14 '17 at 10:30
@Michael-sqlbot: Thanks, I think this article explains it: http://galeracluster.com/documentation-webpages/twonode.html Why not posting your comment as an answer? — lepe, May 15 '17 at 01:08
If I set `SET GLOBAL wsrep_provider_options='pc.bootstrap=YES'` on the slave database it changes from `Running` to `Master, Synced, Running`. It works even if 2 nodes goes down (the two on server-master). The only downside is that I have to run in through a script, which is not elegant. — lepe, May 15 '17 at 01:34

score 0 · Accepted Answer · answered Jun 06 '17 at 09:03

it's related to galera cluster behaviour.

You shutdown mysql in 1 node, before shutting down mysql in the node send leave request and leave the cluster gracefully. Your cluster detect that 2 nodes are leave and still can work with 2 nodes.
You shutdown host, mysql is killed and of course can not send leave request. Cluster detect that 2 nodes are dead and it has only 2 nodes left that <= 50% total number of cluster size. Cluster is put it fail state and can not accept connection.

So, you can not connect from client to maxscale ---> cluster.

ref : http://galeracluster.com/documentation-webpages/weightedquorum.html

LXD + Galera Cluster + Max Scale : Shutting down server != Stopping containers

1 Answers1