Frozen replication on MySQL 5.7.23 and 5.7.24

Question

We have seen repeatedly this problem on MySQL 5.7.23 and 5.7.24. Replication is frozen on error and I cannot manually restart it using "stop slave; start slave;"

MySQL runs on Debian 9 on VMs on Google compute engine and all packages are up to date. VMs have 4CPUs/26GB RAM. On MySQL replicas we use parallel replication processes, ROW binlog format and LOGICAL_CLOCK for slave-parallel-type

Scenario of our problems:

Replication on read-only replica stops with error 1205.
Error text: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 7 failed executing transaction 'ANONYMOUS' at master log mysql-bin.00xxxx, end_log_pos xxxxxxxxx. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
In bin log I see some normal UPDATE command - we have tons of them during the day.
Check of performance_schema.replication_applier_status_by_worker shows error like this: "Worker 1 failed executing transaction 'ANONYMOUS' at master log mysql-bin.00xxxx, end_log_pos xxxxxxxxx; Lock wait timeout exceeded; try restarting transaction"
I start command "stop slave;" from mysql command line tool but command is frozen - processlist shows process | 56327 | root | localhost | NULL | Query | 61716 | Killing slave | stop slave | running indefinitely
manual reboot of the instance from Linux command line does not work. Instance is frozen and I cannot ssh it, I have to force restart from Google GCE web gui.
In error.log I can see sequence of error messages Worker 7 failed executing transaction 'ANONYMOUS' at master log mysql-bin.00xxxx, end_log_pos xxxxxxxx; Could not execute Update_rows event on table xxxx.xxxx; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log mysql-bin.00xxxx, end_log_pos xxxxxxxxx, Error_code: 1205
sequence ends with error message: worker thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable. Error_code: 1205

I tried to set higher variable slave_transaction_retries (to 30) which lowered number of "frozen cases" but problem still stays. If replication stops I cannot restart it manually from mysql command line tool.

We did not have these problems with frozen replication on 5.7.22 or older releases. Although from time to time we had errors 1205 in replication due to huge amount of UPDATEs we have during the day, manual restart of replication from mysql command line tool always worked without problems.

Situation seems to be a bit better on 5.7.24 which came with many repairs in replication. On 24 we see much less cases of this problem but it is still there.

Can I influence this behavior by some parameter?
What would you recommend to check if this problem happens again?
Can I force restart of frozen replication without restarting MySQL?

Thank you very much for any idea or help.

Frozen replication on MySQL 5.7.23 and 5.7.24

0 Answers0