1

Let me explain the problem in detail,

We have a Google cloud SQL setup(1 master, 1 failover replica and a read replica). From the last two days, we are facing replication delays on both the replica instances which is constantly increasing.(It's up to 16h as I write).

On pulling the logs from the replica instances, we see slave SQL and slave I/O thread being killed very frequently,

2019-07-15T12:35:19.181804Z 1025650 [Note] Slave SQL thread for channel '' exiting, replication stopped in log 'mysql-bin.068343' at position 62535096
2019-07-15T12:35:19.184434Z 1025649 [Note] Slave I/O thread exiting for channel '', read up to log 'mysql-bin.068473', position 63572825

Please find the relevant slave status info and processlist output below.

show slave status;

         Slave_IO_State: Waiting for master to send event
         Master_Log_File: mysql-bin.068826
         Read_Master_Log_Pos: 21806289
         Relay_Log_File: relay-log.000025
         Relay_Log_Pos: 16457199
         Relay_Master_Log_File: mysql-bin.068600
         Slave_IO_Running: Yes
         Slave_SQL_Running: Yes
         Seconds_Behind_Master: 52371
         Slave_SQL_Running_State: System lock
         Master_Retry_Count: 86400
show processlist;

| 1504576 | system user         | | NULL            | Connect |  3288 | Waiting for master to send event | NULL                                                                                                 |
| 1504577 | system user         | | NULL            | Connect | 52623 | System lock                      | NULL 

Slave related global variables.

| binlog_cache_size                                        | 32768
| binlog_checksum                                          | CRC32
| binlog_direct_non_transactional_updates                  | OFF
| binlog_error_action                                      | ABORT_SERVER
| binlog_format                                            | ROW
| binlog_group_commit_sync_delay                           | 0
| binlog_group_commit_sync_no_delay_count                  | 0
| binlog_gtid_simple_recovery                              | ON
| binlog_max_flush_queue_time                              | 0
| binlog_order_commits                                     | ON
| binlog_row_image                                         | FULL
| binlog_rows_query_log_events                             | OFF
| binlog_stmt_cache_size                                   | 32768
| slave_allow_batching                                     | OFF
| slave_checkpoint_group                                   | 512
| slave_checkpoint_period                                  | 300
| slave_compressed_protocol                                | OFF
| slave_exec_mode                                          | STRICT
| slave_load_tmpdir                                        | /mysql/tmp
| slave_max_allowed_packet                                 | 1073741824
| slave_net_timeout                                        | 30
| slave_parallel_type                                      | DATABASE
| slave_parallel_workers                                   | 0
| slave_pending_jobs_size_max                              | 16777216
| slave_preserve_commit_order                              | OFF
| slave_rows_search_algorithms                             | TABLE_SCAN,INDEX_SCAN
| slave_skip_errors                                        | OFF
| slave_sql_verify_checksum                                | ON
| slave_transaction_retries                                | 10
| slave_type_conversions                                   |

All three instances have 8 vCPU cores, 30GB RAM, and around 550GB SSD Storage, and are cloud SQL MySQL 2nd generation instances(MySQL version 5.7). Master has a very stable CPU usage pattern at around 40% usage, and both the failover and read replica are at around 60% usage.

Does anybody have any idea why the Slave SQL thread is constantly in the "System lock" stage and not proceeding with the replication? Any pointers would be great!

  • Hmmm... 40% CPU is high. Do you have thousands of queries/second? Lots of slow queries? – Rick James Jul 16 '19 at 19:42
  • Is replication using GTIDs? – Rick James Jul 16 '19 at 19:44
  • Do you use MyISAM (I hope not)? When it happens, quickly grab `SHOW FULL PROCESSLIST;` - this may give some clues. Do you use `LOCK TABLES` (I hope not)? – Rick James Jul 16 '19 at 19:46
  • @RickJames Yeah, Replication is using GTIDs. Engine is InnoDB,not MyISAM, also we don't use `LOCK TABLES`. – techiewickie Jul 18 '19 at 05:59
  • @RickJames We have roughly around 5K queries/second on the master. We had quite a number of slow queries before, but we fixed most of them last week. We'll try resharding, so that we can distribute the queries. – techiewickie Jul 18 '19 at 06:28

1 Answers1

0

Google Cloud Platform Support here!

Could you please confirm the QPS ratio on your master instance? If the master instance gets overloaded with queries, when the CPU usage goes beyond the maximum than it can process, row-based replication helps ease the load but as replication is single-threaded the replica gets blocked.

One good practice would be resharding your master instance to 2 or more instances to split the query load. You can refer to more best practices here.

To further assist you, please refer to the following link in which you can create a case at the public issue tracker. Doing so, we will be able to further investigate it with our internal tools and provide you with more meaningful information.

bhito
  • 116
  • 2
  • I just checked the QPS ratio and it's roughly 5:1 write to reads. It could be the master getting overloaded with queries as you suggested. I'll try and see if we can proceed with a resharding approach to ease the load. I'll anyway create a ticket on the issue tracker. – techiewickie Jul 18 '19 at 06:20