How to stop a deadlock on one node from crashing entire cluster?

Question

I'm running a 3x node Galera Cluster under MariaDB. The application is in PHP using the mysqli extension.

Very occasionally I get a Deadlock on write. I'm working on improving my application to handle or avoid that kind of failure, but in the mean time I need the cluster to stay up when this happens.

The problem is that as soon as the deadlock occurs, not just one, but all three nodes in the cluster crash. The node where the deadlock originates suffers the MySQL server has gone away error and after max_connect_errors starts refusing connections permanently, thus requiring a manual server restart.

What I don't get is why the other nodes go down too. They both start erroring with "WSREP has not yet prepared node for application use" which means the entire application crashes with no database nodes accepting connections.

How can I ensure that the rest of the cluster stays up when one node suffers an albeit rare deadlock?

Update:

A month later and another deadlock causes a similar problem. Again, one node brings down everything.

The first connection gets a deadlock (at commit phase) so the application tries to replay the transaction. This hangs for almost a minute and fails again.

After the first connection fails to recover, all other connections start failing with (1205) "Lock wait timeout exceeded" rendering the entire cluster useless.

I should add that the application does not use locks. However it got itself tied in a knot, it's just with regular transactional queries.

Get some more insight into the alleged "crash". It sounds more like a hang or a mis-configured timeout. Let's see your my.cnf. Do _not_ run your application with 'root' mysql user; then there will be a spare connection for you to get in. When you get in, do `SHOW PROCESSLIST;` and other diagnostics. — Rick James, Jul 01 '17 at 02:06
Updated with a similar crash. my.cnf is here: https://pastebin.com/raw/1e2bcrks — Tim, Aug 02 '17 at 14:23
Out of interest is this something you've only experienced with MariaDB, as opposed to native MySQL? Because this notion that they're somehow interchangeable doesn't seem to be true. We've experienced MariaDB producing different results to MySQL from the same queries and dataset and various other inconsistencies between the two. Granted that's a totally different problem but I'd be interested to know if this is something that's specific to MariaDB. — Andy, Aug 02 '17 at 14:25
I transitioned to MariaDB at the same time as implementing Galera cluster. i.e. I've never clustered native MySQL — Tim, Aug 02 '17 at 14:33
I am getting the exact same problem as @Tim described. Did you manage to solving it? — Aistis, Mar 22 '18 at 23:54
I seem to have solved it, but unsure how. I think (and it was a while ago) that my code was retrying all failed transactions, including those that timed out. I switched to retrying ONLY deadlocks (code 1213) and not had this problem since. — Tim, Mar 23 '18 at 10:56
Looked through my Git history and I think I can be more precise. My retry code was retrying up to `n` times for the deadlock to lift, but if the retried transaction failed for a *different* reason (like 1205) it would continue to retry. I change that so only a repeated deadlock would continue the attempts anything else would give up and quit. — Tim, Mar 23 '18 at 11:10

score 1 · Answer 1 · answered Apr 10 '18 at 12:46

I'm answering my own question as I've managed to avoid crashes. However, I still have problems with secondary errors and have started a new thread with the specifics.

My recovery code now handles secondary errors differently. It will retry deadlocks a couple of times, but only while the error is a deadlock. If any other type of error occurs the application will give up.

Although this means disappointed users receiving errors, I haven't had a cluster crash since this change and haven't seen the dreaded "server gone away" error.

How to stop a deadlock on one node from crashing entire cluster?

1 Answers1

Linked