1

We have a Percona Xtradb-v2 cluster set up with 3 nodes.

Everything was working and in synchronisation when we shut down nodes 2 and 3, leaving only node 1. The nodes stayed down for a week, during which time the database grew by 100GB in size.

When we attempted to restart the nodes 2 and 3, the startup failed during the initial SST, after less than a minute. I have tried completely removing the /var/lib/mysql and restarting but it has the same effect.

The error logs appear to show an issue with the initial SST, possibly due to the volume of data required to be transferred for the initial startup. We have sufficient disk space, and file permissions are correct. The xtrabackup package is installed and available (and worked previously anyway).

The logs show a 'no such file or directory'

Joiner Logs show: Dec 15 01:21:51 xm1adb05 mysqld: #011Group state: 67e7e56d-8e95-11e6-a9d2-ce8abe8f95bb:5766440 Dec 15 01:21:51 xm1adb05 mysqld: #011Local state: 00000000-0000-0000-0000-000000000000:-1 Dec 15 01:21:51 xm1adb05 mysqld: 2016-12-15 01:21:51 13029 [Note] WSREP: New cluster view: global state: 67e7e56d-8e95-11e6-a9d2-ce8abe8f95bb:5766440, view# 54: Primary, number of nodes: 2, my index: 1, protocol version 3 Dec 15 01:21:51 xm1adb05 mysqld: 2016-12-15 01:21:51 13029 [Warning] WSREP: Gap in state sequence. Need state transfer. Dec 15 01:21:51 xm1adb05 mysqld: 2016-12-15 01:21:51 13029 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.23.40.115' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '13029' '' ' Dec 15 01:21:51 xm1adb05 mysqld: WSREP_SST: [INFO] Logging all stderr of SST/Innobackupex to syslog (20161215 01:21:51.575) Dec 15 01:21:51 xm1adb05 -wsrep-sst-joiner: Streaming with xbstream Dec 15 01:21:51 xm1adb05 -wsrep-sst-joiner: Using socat as streamer ... Dec 15 01:21:51 xm1adb05 mysqld: 2016-12-15 01:21:51 13029 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (67e7e56d-8e95-11e6-a9d2-ce8abe8f95bb): 1 (Operation not permitted) Dec 15 01:21:51 xm1adb05 mysqld: #011 at galera/src/replicator_str.cpp:prepare_for_IST():507. IST will be unavailable. ... Dec 15 01:21:51 xm1adb05 mysqld: 2016-12-15 01:21:51 13029 [Note] WSREP: Member 1.0 (xm1adb05) requested state transfer from '*any*'. Selected 0.0 (xm1adb04)(SYNCED) as donor. Dec 15 01:21:51 xm1adb05 mysqld: 2016-12-15 01:21:51 13029 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 5766440) Dec 15 01:21:51 xm1adb05 mysqld: 2016-12-15 01:21:51 13029 [Note] WSREP: Requesting state transfer: success, donor: 0 Dec 15 01:21:51 xm1adb05 mysql-systemd: State transfer in progress, setting sleep higher ... Dec 15 01:22:02 xm1adb05 -wsrep-sst-joiner: xtrabackup_checkpoints missing, failed innobackupex/SST on donor Dec 15 01:22:02 xm1adb05 -wsrep-sst-joiner: Cleanup after exit with status:2 Dec 15 01:22:02 xm1adb05 mysqld: 2016-12-15 01:22:02 13029 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.23.40.115' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '13029' '' : 2 (No such file or directory) Dec 15 01:22:02 xm1adb05 mysqld: 2016-12-15 01:22:02 13029 [ERROR] WSREP: Failed to read uuid:seqno from joiner script. Dec 15 01:22:02 xm1adb05 mysqld: 2016-12-15 01:22:02 13029 [ERROR] WSREP: SST script aborted with error 2 (No such file or directory) Dec 15 01:22:02 xm1adb05 mysqld: 2016-12-15 01:22:02 13029 [ERROR] WSREP: SST failed: 2 (No such file or directory) Dec 15 01:22:02 xm1adb05 mysqld: 2016-12-15 01:22:02 13029 [ERROR] Aborting

Donor logs show: Dec 15 01:22:02 xm1adb04 mysqld: 2016-12-15 01:22:02 6531 [ERROR] WSREP: Failed to read from: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.23.40.115:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' '' --gtid '67e7e56d-8e95-11e6-a9d2-ce8abe8f95bb:5766440' Dec 15 01:22:02 xm1adb04 mysqld: 2016-12-15 01:22:02 6531 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.23.40.115:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' '' --gtid '67e7e56d-8e95-11e6-a9d2-ce8abe8f95bb:5766440': 22 (Invalid argument) Dec 15 01:22:03 xm1adb04 mysqld: 2016-12-15 01:22:03 6531 [ERROR] WSREP: Command did not run: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.23.40.115:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' '' --gtid '67e7e56d-8e95-11e6-a9d2-ce8abe8f95bb:5766440'

Similar actions successfully started the secondary nodes on another (much smaller) database, so it would seem that the size may be the issue.

Can anyone give some help on how we can initialise and re-start the additional nodes?

Steve Shipway
  • 3,754
  • 3
  • 22
  • 39

5 Answers5

1

Just switch to rsync.

wsrep_sst_method                = rsync

Once the node is in sync, stop it, switch to xtrabackup and start again.

Ace.Di
  • 131
  • 2
  • 3
  • 1
    This works as a workaround (as does manually copying the entire data directory over) but requires the entire cluster to be unavailable during the copy. This is not optimal as a production cluster needs to be available all the time (which is why we're using galera). We really need to know the root cause of the SST failing even though the IST works. – Steve Shipway Jan 31 '19 at 05:13
0
  • XtraBackup failed in your case. You can get lead of XB failure by looking at XB generated log files.
  • BTW, you should check the XB log on donor node. As per the short snippet XB reports (Invalid argument)
Krunal
  • 21
  • 1
0

The solution to restart the nodes was to first restart the one remaining cluster member (1), then completely wipe the /var/lib/mysql on the joiners (2 and 3) before trying to restart and rejoin. This causes an SST and all worked.

The problem seems to be that the nodes 2 and 3 were partitioned on node 1, and so it was not allowing the SST to complete (I think maybe the final IST was denied and so the SST rolled back). Restarting node 1 seems to reset the partitioning and then the SST could complete.

We also had a rather small gcache.size which didn't help as there were a lot of writes going on in the database.

Later events showed that the SST seems to have failed due to issues with xtrabackup on the donor node. The xtrabackup process didn't like the my.cnf settings where we had a line duplicated. Fixing this and restarting the donor (to end the partitioning) has let things work.

Steve Shipway
  • 3,754
  • 3
  • 22
  • 39
  • Seeing what appears to be the same problem and unfortunately this is not acceptable for us. The joining is required to happen without manual intervention. – Jan Hudec Nov 07 '18 at 13:40
  • Another possible cause was that the xtrabackup is more picky about the mysql.cnf that mysql, which can cause the SST to fail. Look for duplicate lines or non-fatal syntax errors. This can then result in SST failing which gets the node partitioned. – Steve Shipway Nov 08 '18 at 19:48
0

Whoever looking for an answer, having similar errors especially at donor, in my case I had to open firewall ports for 4444, 4567 and 4568 across all the nodes.

user109764
  • 576
  • 6
  • 11
-1

Please take backup first of advanced Node that is Node 1 and restore on second and third then then try to restart second and third node ,that will solve your problems .

because while starting second node your donor is not receiving last LSN no from joiner because second node datadir is empty there is no data