Multi-state MySQL master/slave pacemaker resource fails to launch on cluster nodes

Question

Setup

I'm setting up an HA cluster for a web application using two physical servers in a Corosync/Pacemaker managed cluster.

After finding out I was heading the wrong way, I decided to use heartbeat's bundled MySQL resource agent to manage my MySQL instances across the cluster.

Currently, there is a working master/slave configuration from node1 (current master) to node2 (current slave). Now I would like Pacemaker to manage my MySQL instances so it can promote/demote master or slave.

According to this (old) wiki page, I should be able to achieve the setup by doing so:

primitive p_mysql ocf:heartbeat:mysql \
  params binary="/usr/sbin/mysqld" \
  op start timeout="120" \
  op stop timeout="120" \
  op promote timeout="120" \
  op demote timeout="120" \
  op monitor role="Master" timeout="30" interval="10" \
  op monitor role="Slave" timeout="30" interval="20"

ms ms_mysql p_mysql \
  meta clone-max=3

As you can see, I did however change slightly the interval for the second op monitor parameter, since I know Pacemaker identifies actions by Resource name (here, p_mysql), action name, and interval. The interval was the only way to differentiate the monitor action on a slave node from the monitor action on a master node.

Problem

After committing the changes to the CID and opening an interactive crm_mon, I could see that Pacemaker failed to start the resource on every node. See attached screenshots:

Sorry cannot upload more than 2 links because I do not have enough reputation yet... Screenshots in comments

And it loops over and over, trying to set the current master to a slave, the current slave to a slave, then to a master... It is clearly looping and fails to instantiate properly MySQL instances.

For reference, my crm configure show:

node 1: primary
node 2: secondary
primitive Failover ocf:onlinenet:failover \
    params api_token=108efe5ee771368557869c7a837361a7c786f210 failover_ip=212.129.48.135
primitive WebServer apache \
    params configfile="/etc/apache2/apache2.conf" statusurl="http://127.0.0.1/server-status" \
    op monitor interval=40s \
    op start timeout=40s interval=0 \
    op stop timeout=60s interval=0
primitive p_mysql mysql \
    params binary="/usr/sbin/mysqld" \
    op start timeout=120 interval=0 \
    op stop timeout=120 interval=0 \
    op promote timeout=120 interval=0 \
    op demote timeout=120 interval=0 \
    op monitor role=Master timeout=30 interval=10 \
    op monitor role=Slave timeout=30 interval=20
ms ms_mysql p_mysql \
    meta clone-max=3
clone WebServer-clone WebServer
colocation Failover-WebServer inf: Failover WebServer-clone
property cib-bootstrap-options: \
    dc-version=1.1.12-561c4cf \
    cluster-infrastructure=corosync \
    cluster-name=ascluster \
    stonith-enabled=false \
    no-quorum-policy=ignore

Screenshots: [crm_mon step 1](http://i.stack.imgur.com/uFvvL.png) [crm_mon step 2](http://i.stack.imgur.com/9XeYD.png) [crm_mon step 3](http://i.stack.imgur.com/sAxFW.png) — Habovh, Feb 02 '16 at 20:04
@gf_ Yep that's what I said, I've got a working master/slave setup without Pacemaker between my two bodes... — Habovh, Feb 02 '16 at 20:47
Could you include `notify=true` in the metadata of `ms_mysql`? Be sure to reset the fail counter afterwards and restart the resource. — gxx, Feb 02 '16 at 21:29
Also, did you read [this](https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/mysql#L89)? Does `ping` between your nodes work accordingly? Does mysql listen / bind to these answering interfaces? — gxx, Feb 02 '16 at 22:22
At first I did not include the `notify=true`, but adding it to the resource did not change anything to the issue, still this kind of *loop*. I also tried to setup the basic MySQL replication (without Pacemaker) using information from `uname -n`, and it is just working fine... — Habovh, Feb 03 '16 at 10:59
Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/35227/discussion-between-jordan-becker-and-gf). — Habovh, Feb 03 '16 at 11:27
Here's an extract from `/var/log/corosync/corosync.log` on the `primary` server (hostname and name of the node in the cluster): http://pastebin.com/taGFg9AC — Habovh, Feb 03 '16 at 16:46
I did not set it since it is the default one, do you think it is required? Because there are a lot of different parameters all set to `required="0"`... — Habovh, Feb 03 '16 at 16:57
The param is not required. However, if the default value differs from your setup, you've to set it accordingly. At it seems, the default is set to `/var/lib/mysqld/mysqld.sock`. Try to set the param to `/var/run/mysqld/mysqld.sock` (this is the Debian default or to the correct path of your socket). — gxx, Feb 03 '16 at 17:04
Yes, I am on Debian 8. Now I added every parameter except: `test_table`, `test_user`, `test_passwd`, `enable_creation`, `additional_parameters`, `max_slave_lag`, `evict_outdated_slaves` and `reader_attribute`. Every other parameter is set properly and yet Pacemaker fails to start everything properly... By the way, how did you get the default value for the `socket` parameter? — Habovh, Feb 04 '16 at 09:28
Current `crm configure show` output: http://pastebin.com/TWwFMtZw — Habovh, Feb 04 '16 at 09:45
Could you stop the resource, reset the fail count, start it again and provide the log? — gxx, Feb 04 '16 at 13:22
"By the way, how did you get the default value for the socket parameter?": I've just had a look into the log you've provided. As you didn't set the value, I assumed that `/var/lib/mysqld/mysqld.sock` is the default. — gxx, Feb 04 '16 at 13:23
Ok thanks, that's a good point. I'll try to cleanup and see if I see something new in the logs! But I only have a little hope... — Habovh, Feb 04 '16 at 13:25
I guess `crm resource cleanup` is enough to reset the fail count? — Habovh, Feb 04 '16 at 13:26
Ok so that did not help. I somehow managed Pacemaker to start mysql on `node1` as master, by changing this in `crm configure`: `location cli-prefer-MySQL MySQL-ms role=Master inf: node1`. But Pacemaker will not start mysql on `node2`, here's the log: http://pastebin.com/EX0ZKjvf — Habovh, Feb 04 '16 at 13:59
From what I could see, at some point it tries to login on mysql server with username `replication_user` (which is right) using password `YES` and yet it fails to connect. I assume it uses Pacemaker's params to login, and when I try to login manually with the same credentials, it works... — Habovh, Feb 04 '16 at 14:01
Something got my attention in the previous log, the DB privileges for the `replication_user` were not set correctly [according to the spec](https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/mysql#L214-L219). New log here, less errors, still not working: http://pastebin.com/XpLhZ3KC — Habovh, Feb 04 '16 at 14:28
This "Slave failed to initialize relay log info structure from the repository" seems to be an issue with your `mysql` setup. We're on a good way, I think. — gxx, Feb 04 '16 at 16:36
Could you start `mysql` manually and show the output of `SHOW SLAVE STATUS\G` from the slave node? — gxx, Feb 04 '16 at 16:38
Actuelly I noticed that also, and I made sure it worked before trying again, and something went wrong with my setup. I fixed it and now it is working fine without Pacemaker. I'm trying to play around with `interval` values because logs says that OP monitor failed on node2... Will upload log — Habovh, Feb 04 '16 at 16:49
Here's log file where node1 is master, node2 cycles between stopped, failed and —sometimes quickly— slave. http://pastebin.com/fXRDi9Hg — Habovh, Feb 04 '16 at 16:58
Master node (node1) mysql log file: http://pastebin.com/KnJmTrgM. Slave node (node2) mysql log file: http://pastebin.com/NH1HDp6d. — Habovh, Feb 04 '16 at 17:45
There is a [bug report](https://bugs.mysql.com/bug.php?id=79771). Which `mysql` version is this? Could you try to create a new (empty) database, and only replicate this db, to see if this get's you going? — gxx, Feb 04 '16 at 17:50
Doing this on mysql 5.7.10, but like I said, the replication works fine without Pacemaker, so I do not understand why it would not work properly at this point. — Habovh, Feb 04 '16 at 17:56
Which version of the resource agent are you using? Could you (temporarily) replace yours with the one from GitHub? — gxx, Feb 04 '16 at 18:09
MySQL 5.7.6 has introduced the concept of replication channels, so the output of replication-related mysql command may have changed. So using the newest RA (as gf_ suggests) might be a good idea. — remote mind, Feb 04 '16 at 22:03
I made a `diff` before using the master script, here's the result: [http://pastebin.com/2iH45b1X](http://pastebin.com/2iH45b1X). As you can see, there's not many differences between the version I'm using versus the one on the `master` branch. I will still try but I doubt very much this will resolve the issue. — Habovh, Feb 05 '16 at 09:58
Oh my god guys, I found out what was going wrong. By updating the RA to the one from GitHub, I stumbled across the [`mysql-common.sh`](https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/mysql-common.sh) file, which essentially holds low-level commands and **RA defaults**. I took the time to read the defaults, and had an idea when seeing `test_user` and related stuff. Going back to the RA, I noticed it was used to perform a simple `select` on the database. Default was set to `root` w/o password, which would not work. I used my `replication_user` granting the user read access. — Habovh, Feb 05 '16 at 10:50
And thank god it worked, finally! Thank you for your support @gf_ and also thanks for taking the time to read a lot of not funny debugging steps *remote mind* ! The lesson I learned these last few days about HA: debugging is **complicated**, do not hesitate to override defaults values for resource agent settings, and have a lot of patience! — Habovh, Feb 05 '16 at 10:53
Glad to hear! Great! Good luck with `pacemaker`, `corosync` and all the other involved tools - these are really really really great! Cheers! — gxx, Feb 05 '16 at 11:14

score 3 · Accepted Answer · edited Apr 13 '17 at 12:14

Solution

Thanks to the folks that investigated with me, I was able to find the solution to my problem and I do now have a working setup. If you feel brave enough, you can read the comments on the original question, but here is a summary of the steps that helped me solve my issue.

Read the source

First thing to do when setting up HA resources, will sound typical, but RTFM. No seriously, learn how the software you're planning to use works. In that particular case, my first mistake was not to read and understand carefully enough how the resource agent (RA) works. Since I was using the mysql RA provided by Heartbeat, the RA source script was available on ClusterLabs' resource-agents GitHub repo.

Do not forget to read the source of included files!

Make sure your software is up-to-date

Was not clearly identified as an issue in my particular case, but as @gf_ & @remote mind suggested, it is always a good thing to have a version of your RA that works with your software version.

Fill-in the damn params

Number one rule in HA: do not rely on default values.

That's not true, sometimes you can, but honestly, if I had provided every optional parameter that I could to the RA, I would have fixed my issue way quicker.

This is actually where the Read the source part is important, since it will allow you to truly understand why there are parameters needed. However, since they are often only briefly described, you may need to go further than the meta-data and find where are the parameters used. In my case, the thing did not work for several reasons:

I did not provide the socket path, and the default one for the script did not match the default one for my system (Debian 8).
I did not provide test_user, test_passwd: these were present in the meta-data but I thought that I did not needed this. After I decided to look what it was used for, I simply found out that these parameters were used to perform a simple select count(*) on the database. And since the defaults are set to use user root without password, it did not work in my case (because on my databases, root needs a password to connect the database). This particular step prevented the RA from performing the check if the current node was a slave or not.
Some other params were also missing, and I knew I needed them only after I discovered where the damn default settings were hidden.

Final word

Again, thanks a lot to @gf_ for taking the time to investigate with me and provide leads in order to debug my setup.

Good HA setups are not that easy to achieve (especially when starting from scratch), but if well configured can be really powerful and provide peace of mind.

Note: peace of mind not guaranteed ;)

+1 for this answer, as for the question! Great you were able to finally sort it out, and I'm feeling very flattered. It's nice to have you on board here at SF - we need more people like you.. :) Cheers! — gxx, Feb 05 '16 at 12:38
Just another small pointer: There are multiple [mailing lists](http://clusterlabs.org/wiki/Mailing_lists), some in active use, some archived by now. Lots of good people on these... — gxx, Feb 05 '16 at 12:44
Thank you @gf_ ! That's really nice from you to say that! Especially when you know it is the first time I setup a HA cluster ever :) Thanks for the pointer too, it may come in handy in the future, either for me or for someone else stumbling across this question! — Habovh, Feb 05 '16 at 13:12
One last hint: If you're just starting out with HA etc., make sure to inform yourself about "STONITH" / fencing. This is a really really really important core concept, especially if you're one day building clusters with shared storage, but in general as well: Read [this](http://clusterlabs.org/doc/crm_fencing.html), [that](https://ourobengr.com/ha/), have a look [over here](http://advogato.org/person/lmb/diary/105.html) and, if you like comics, check out [this](https://ourobengr.com/stonith-story/). — gxx, Feb 05 '16 at 19:33
Thank you @gf_ ! The links you shared are really worth reading. These are indeed essential concepts of HA that should be taken in consideration. Before reading all of this I would put myself in the *I don't need fencing* kind of guy. Well I changed my mind! — Habovh, Feb 08 '16 at 11:17