Mariadb 10.1 - 1 of 3 nodes will not restart

Question

Here is the scenario, had to make a config change to Mariadb in a 3 node cluster. I edited the config file, and shut down the node with:

# service mysqld stop

Made the change on the other 2 nodes and did the same. When I started the most advanced node with

# galera_new_cluster

Started OK

Started on the net node, started ok. The last node is the first node I made the change to.

This node does not start. It hits a service start up time out and just dies. No errors that I can find on the failed node, or on the primary node. I did adjust the timeout to 2 hours from the 90 second default, which again, just timed out.

Just looking for some clues as to what might be going on.

Failed now galera config:

[galera]
# Mandatory settings
binlog_format=ROW
default-storage-engine=innodb
innodb_autoinc_lock_mode=2
bind-address=192.168.10.238
wsrep_on=ON
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_cluster_address="gcomm://192.168.10.200,192.168.10.201"
## Galera Cluster Configuration
wsrep_cluster_name="Cluster1"
## Galera Synchronization Configuration
wsrep_sst_method=rsync
## Galera Node Configuration
wsrep_node_address="192.168.10.238"
wsrep_node_name="db3"

Start up timeout message:

# service mysql start
Starting mysql (via systemctl):  Job for mariadb.service failed because a timeout was exceeded. See "systemctl status mariadb.service" and "journalctl -xe" for details.
                                                           [FAILED]

On the primary node, I see the last node join the cluster successfully:

2018-03-13  9:08:36 140261636175616 [Note] WSREP: (050b87ee, 'tcp://0.0.0.0:4567') connection established to ce802915 tcp://192.168.10.238:4567
2018-03-13  9:08:36 140261636175616 [Note] WSREP: (050b87ee, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 
2018-03-13  9:08:36 140261636175616 [Note] WSREP: declaring 8ee7874f at tcp://192.168.10.201:4567 stable
2018-03-13  9:08:36 140261636175616 [Note] WSREP: declaring ce802915 at tcp://192.168.10.238:4567 stable
2018-03-13  9:08:36 140261636175616 [Note] WSREP: Node 050b87ee state prim
2018-03-13  9:08:36 140261636175616 [Note] WSREP: view(view_id(PRIM,050b87ee,87) memb {
        050b87ee,0
        8ee7874f,0
        ce802915,0
} joined {
} left {
} partitioned {
})
2018-03-13  9:08:36 140261636175616 [Note] WSREP: save pc into disk
2018-03-13  9:08:36 140261627782912 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
2018-03-13  9:08:36 140261627782912 [Note] WSREP: STATE_EXCHANGE: sent state UUID: a3fa8cd5-26bf-11e8-8f00-a686a9b10fbd
2018-03-13  9:08:36 140261627782912 [Note] WSREP: STATE EXCHANGE: sent state msg: a3fa8cd5-26bf-11e8-8f00-a686a9b10fbd
2018-03-13  9:08:36 140261627782912 [Note] WSREP: STATE EXCHANGE: got state msg: a3fa8cd5-26bf-11e8-8f00-a686a9b10fbd from 0 (b1)
2018-03-13  9:08:36 140261627782912 [Note] WSREP: STATE EXCHANGE: got state msg: a3fa8cd5-26bf-11e8-8f00-a686a9b10fbd from 1 (db2)
2018-03-13  9:08:36 140261627782912 [Note] WSREP: STATE EXCHANGE: got state msg: a3fa8cd5-26bf-11e8-8f00-a686a9b10fbd from 2 (db3)
2018-03-13  9:08:36 140261627782912 [Note] WSREP: Quorum results:
        version    = 4,
        component  = PRIMARY,
        conf_id    = 20,
        members    = 2/3 (joined/total),
        act_id     = 3166274,
        last_appl. = 3166245,
        protocols  = 0/7/3 (gcs/repl/appl),
        group UUID = 4687a061-0310-11e8-a49f-534404044853
2018-03-13  9:08:36 140261627782912 [Note] WSREP: Flow-control interval: [28, 28]
2018-03-13  9:08:36 140261627782912 [Note] WSREP: Trying to continue unpaused monitor
2018-03-13  9:08:36 140261931068160 [Note] WSREP: New cluster view: global state: 4687a061-0310-11e8-a49f-534404044853:3166274, view# 21: Primary, number of nodes: 3, my index: 0, protocol version 3
2018-03-13  9:08:36 140261931068160 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-03-13  9:08:36 140261931068160 [Note] WSREP: REPL Protocols: 7 (3, 2)
2018-03-13  9:08:36 140261931068160 [Note] WSREP: Assign initial position for certification: 3166274, protocol version: 3
2018-03-13  9:08:36 140261686085376 [Note] WSREP: Service thread queue flushed.
2018-03-13  9:08:39 140261636175616 [Note] WSREP: (050b87ee, 'tcp://0.0.0.0:4567') turning message relay requesting off

I see no errors when Mariadb tries to start on the failed host:

Mar 13 09:11:43 pn09 systemd: Starting MariaDB 10.1.30 database server...
Mar 13 09:11:49 pn09 sh: WSREP: Recovered position 4687a061-0310-11e8-a49f-534404044853:2564462
Mar 13 09:11:49 pn09 mysqld: 2018-03-13  9:11:49 122545243265280 [Note] /usr/sbin/mysqld (mysqld 10.1.30-MariaDB) starting as process 7558 ...
Mar 13 09:11:50 pn09 rsyncd[7667]: rsyncd version 3.0.9 starting, listening on port 4444

I'm at a bit of a loss on this one and any direction is appreciated. It's unclear to me why there would be an issue just stopping and starting the service on a single node like this. The other 2 I can bring up and down cleanly without issue.

One point of note, while digging through this I noticed that all 3 nodes have SEQNO set to -1 in the grastate.dat file. Not sure why that would happen, if it's a critical issue or what.

Other point of interest - some background processes remain running after service start up fails:

# ps aux |grep mysql
mysql     7558  0.2  0.3 349912 57920 ?        Ssl  09:11   0:00 /usr/sbin/mysqld --wsrep_start_position=4687a061-0310-11e8-a49f-534404044853:2564462
mysql     7567  1.2  0.0 113388  1796 ?        S    09:11   0:03 /bin/bash -ue /usr//bin/wsrep_sst_rsync --role joiner --address 192.168.10.238 --datadir /var/lib/mysql/ --parent 7558
mysql     7667  0.0  0.0 114652  1068 ?        S    09:11   0:00 rsync --daemon --no-detach --port 4444 --config /var/lib/mysql//rsync_sst.conf

A few questions: Is perhaps SELinux running (in enforcing mode)? Do all the nodes have `wsrep_sst_method=rsync`? Are you able to log in with the `mysql` client on the node? Which 10.1 version is this specifically? Is it the same version on all 3 nodes? Are you able to try using a different wsrep_sst_method? — dbdemon, Mar 25 '18 at 17:09
SELinux is off on all 3 nodes. For using mysql as client - are you asking of I can log in on the node that will not reconnect? No, mysql. Same version across the board Have not tried a different wsrep_sst method. What is really throwing me here is all I did was restart the service and it did not come back up. — Marc, Mar 26 '18 at 17:43
It's possible hat you've encountered a bug, so have a look at [MariaDB's Jira](https://jira.mariadb.org/projects/MDEV/issues) if you haven't already done so. Can you tell if rsync is being started on the donor node at all? What does the log of the donor node look like? (I don't think that's the one you posted above?) Also, what was the config change you did? — dbdemon, Mar 26 '18 at 21:26

Mariadb 10.1 - 1 of 3 nodes will not restart

0 Answers0