What's causing subsequent errors when restarting deadlocked transaction?

Question

When restarting a failed transaction at commit stage I get a second failure when restarting the transaction. This is running Galera Cluster under MariaDB 10.2.6.

The sequence of events goes like this:

Commit a transaction (say a single insert).
COMMIT fails with error 1213 "Deadlock found when trying to get lock"
Begin a new transaction to replay the SQL statement[s].
BEGIN fails with error 1047 "WSREP has not yet prepared node for application use"
My application bails to avoid a more serious crash (see notes below)

This happens quite regularly and although the cluster recovers, individual threads receive failures. Yesterday this happened 15 times in one second.

I cannot identify any root cause for this. It seems that the deadlock is the initiator of the problem. The situation should be recoverable (and often is) But with multiple clients all trying to resolve their deadlocks at the same time, the whole thing seems to just fail.

Notes:

This is related to an earlier question where retrying failed transactions caused total crash of the cluster. I've managed to prevent crashes by retrying transactions only on deadlocks. i.e. if a different type of error occurs during a restart the application gives up.

I'm aware that 10.2.6 is not the latest version of MariaDB. I'm nervous to upgrade right now as I've had such bad experiences. I would like to understand the current problem before doing an upgrade and I've been unable to reproduce the errors in a test environment.

score 0 · Answer 1 · answered May 01 '18 at 16:17

I'm not sure, but I suspect 3 tries (not 2) is appropriate. Committing involves two steps:

Checking for a Deadlock purely within the node you are connected to. (Eg: another query is touching the same row or gap.)
Checking with the other nodes to see if they will complain. (Eg: The same row has already been inserted into another node.)

Sure, either of those could happen repeatedly, and in any order. But making 3 tries seems reasonable.

Now, once you have failed "too many" times, it is right to abort and get a human (a DBA type) involved. I suspect that you could restructure your code / application logic / etc in some way to avoid most of the failures. Would you like to provide more details, so we can discuss that possibility...

What kind of table? (Queue, transactions, logging, etc)
SHOW CREATE TABLE. (auto_inc, unique keys, etc; too many UNIQUE keys can aggravate the situation)
What does the INSERT look like?
How often do you run inserts like this one? How often does it fail? (Instrument your code so you count even those that you can recover from.)
How spread out is the Cluster? (ping time)
What other queries are hitting the table? (They may be aggravating the issue.)

Thanks for your responses. Many failures turned out to be simultaneous writes to the same PHP session. The application now avoids redundant session shutdowns and has reduced deadlocks significantly. I will take your advice and increase my retries from 2 to 3. — Tim, May 01 '18 at 18:21

What's causing subsequent errors when restarting deadlocked transaction?

1 Answers1

Linked