Let me explain the problem in detail,
We have a Google cloud SQL setup(1 master, 1 failover replica and a read replica). From the last two days, we are facing replication delays on both the replica instances which is constantly increasing.(It's up to 16h as I write).
On pulling the logs from the replica instances, we see slave SQL and slave I/O thread being killed very frequently,
2019-07-15T12:35:19.181804Z 1025650 [Note] Slave SQL thread for channel '' exiting, replication stopped in log 'mysql-bin.068343' at position 62535096
2019-07-15T12:35:19.184434Z 1025649 [Note] Slave I/O thread exiting for channel '', read up to log 'mysql-bin.068473', position 63572825
Please find the relevant slave status info and processlist output below.
show slave status;
Slave_IO_State: Waiting for master to send event
Master_Log_File: mysql-bin.068826
Read_Master_Log_Pos: 21806289
Relay_Log_File: relay-log.000025
Relay_Log_Pos: 16457199
Relay_Master_Log_File: mysql-bin.068600
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 52371
Slave_SQL_Running_State: System lock
Master_Retry_Count: 86400
show processlist;
| 1504576 | system user | | NULL | Connect | 3288 | Waiting for master to send event | NULL |
| 1504577 | system user | | NULL | Connect | 52623 | System lock | NULL
Slave related global variables.
| binlog_cache_size | 32768
| binlog_checksum | CRC32
| binlog_direct_non_transactional_updates | OFF
| binlog_error_action | ABORT_SERVER
| binlog_format | ROW
| binlog_group_commit_sync_delay | 0
| binlog_group_commit_sync_no_delay_count | 0
| binlog_gtid_simple_recovery | ON
| binlog_max_flush_queue_time | 0
| binlog_order_commits | ON
| binlog_row_image | FULL
| binlog_rows_query_log_events | OFF
| binlog_stmt_cache_size | 32768
| slave_allow_batching | OFF
| slave_checkpoint_group | 512
| slave_checkpoint_period | 300
| slave_compressed_protocol | OFF
| slave_exec_mode | STRICT
| slave_load_tmpdir | /mysql/tmp
| slave_max_allowed_packet | 1073741824
| slave_net_timeout | 30
| slave_parallel_type | DATABASE
| slave_parallel_workers | 0
| slave_pending_jobs_size_max | 16777216
| slave_preserve_commit_order | OFF
| slave_rows_search_algorithms | TABLE_SCAN,INDEX_SCAN
| slave_skip_errors | OFF
| slave_sql_verify_checksum | ON
| slave_transaction_retries | 10
| slave_type_conversions |
All three instances have 8 vCPU cores, 30GB RAM, and around 550GB SSD Storage, and are cloud SQL MySQL 2nd generation instances(MySQL version 5.7). Master has a very stable CPU usage pattern at around 40% usage, and both the failover and read replica are at around 60% usage.
Does anybody have any idea why the Slave SQL thread is constantly in the "System lock" stage and not proceeding with the replication? Any pointers would be great!