2

I've setup mysql gtid replication between master and slave. the interesting thing is that I found the replication stopped working after several minutes, and I have to use stop slave and start slave to restart mysql replication. Can anyone tell me what causes this problem?

change master on slave:

mysql> change master to
                -> master_host = 'master.com',
                -> master_user = 'replica',
                -> master_password = 'password',
                -> master_port = 3306,
                -> MASTER_CONNECT_RETRY = 5,
                -> MASTER_RETRY_COUNT = 0,
                -> MASTER_AUTO_POSITION=1;

Master config file:

[mysqld]
user        = mysql
pid-file    = /var/run/mysqld/mysqld.pid
socket          = /var/run/mysqld/mysqld.sock
port        = 3306
basedir     = /usr
datadir         = /data/mysql_data
tmpdir      = /tmp
lc-messages-dir = /usr/share/mysql
skip-external-locking

binlog-format   = MIXED

interactive_timeout=180
wait_timeout=180

key_buffer      = 16M
max_allowed_packet  = 16M
thread_stack        = 192K
thread_cache_size       = 8

myisam-recover         = BACKUP
max_connections        = 300

query_cache_limit   = 1M
query_cache_size        = 16M

general_log             = 1
log_error = /var/log/mysql/error.log
server-id       = 1
log_bin         = /var/log/mysql/mysql-bin.log
log_bin_trust_function_creators = 1
log-slave-updates   = true

# enable GTID
gtid-mode = on
enforce-gtid-consistency = true
master-info-repository=TABLE
relay-log-info-repository=TABLE
sync-master-info=1
binlog-checksum=CRC32
master-verify-checksum=1

expire_logs_days    = 10
max_binlog_size     = 100M

Slave config:

[mysqld]
user            = mysql
pid-file        = /var/run/mysqld/mysqld.pid
socket          = /var/run/mysqld/mysqld.sock
port            = 3306
basedir         = /usr
datadir         = /data/mysql_data
tmpdir         = /data/mysql_data/tmp
lc-messages-dir = /usr/share/mysql
skip-external-locking

binlog-format   = MIXED

interactive_timeout=180
wait_timeout=180

key_buffer              = 16M
max_allowed_packet      = 16M
thread_stack            = 192K
thread_cache_size       = 8
myisam-recover         = BACKUP
max_connections        = 100

query_cache_limit       = 1M
query_cache_size        = 16M

general_log             = 1
log_error = /var/log/mysql/error.log
server-id               = 2

log_bin                 = /var/log/mysql/mysql-bin.log
log_bin_trust_function_creators = 1
log-slave-updates       = true

# enable GTID
gtid-mode = on
enforce-gtid-consistency = true
sync-master-info=1
binlog-checksum=CRC32
master-verify-checksum=1
slave-sql-verify-checksum=1
binlog-rows-query-log_events=1

expire_logs_days        = 10
max_binlog_size         = 100M

I didn't see any problems in show slave status, but the problem is still interrupting me. Any help would be thanked in advance.

technoob
  • 142
  • 1
  • 14

1 Answers1

2
SET GLOBAL SLAVE_NET_TIMEOUT = 60;
STOP SLAVE;
START SLAVE;

You are right to be skeptical that this will resolve the issue, since no timeout seems to be occurring... nor so you want one to occur, but this should still be the solution. I'll explain.

When replication seems to stall with no errors, IO = Yes, SQL = Yes, Seconds_Behind_Master = 0, this implies a hung replication connection. The slave thinks it's connected, and thinks that no new events have arrived.

In MySQL native asynchronous replication, the slave is responsible for initiating the connection to the master, and then its role becomes passive -- as replication events occur, the master autonomously pushes the replication events to the slave over that connection and the slave, at layer 7, does not do anything in response. TCP does, of course, but neither master nor slave is aware of this. Until a replication event occurs, the connection is simply idle, with no interaction taking place. As long as neither side sees anything like a TCP FIN or RST closing the connection, the connection is assumed to be up.

This breaks down in low traffic periods, if master and slave are connected over any equipment that handles TCP connections in a stateful way -- firewalls, NAT devices, EC2 security groups -- because statefulness usually implies timeout timers. If a connection is idle for too long, the "network" (general term I'll use for things connecting things to other things) will evict the connection from its state tables -- the connection is "forgotten." Fifteen minutes is a commonly-encountered value.

When such a timeout occurs, the network commonly does nothing, other than to simply remove the connection from its internal memory structures. Nothing typically happens over the wire. The parties to the connection are assumed to have abandoned it, or that the traffic has moved to another network, so the device that is purging its memory of the connection -- correctly -- does not actively try to advise the other nodes that the connection is no longer going to be viable.

Then, the next time the master sends an event, after this timeout has lapsed, the network will probably respond by resetting this "unknown" connection in the master's direction, but not in the slave's direction, because the master is the one that initiated the packet that was part of an "unknown" connection. So the slave thinks it has a connection, when in fact there's nothing on the other end of the pipe.

Setting slave_net_timeout solves this in an obvious and a non-obvious way. The non-obvious one is the one we're particularly interested in, while the obvious one becomes our fallback.

When a slave connects to a master, it asks the master to send heartbeat messages. Heartbeats are dummy replication events that do not actually get written to the master's binlog or the slave's relay log. They're generated only when no real replication event has occurred for MASTER_HEARTBEAT_PERIOD seconds.

MASTER_HEARTBEAT_PERIOD, if not explicitly set with CHANGE_MASTER_TO, defaults to slave_net_timeout / 2.

So, setting slave_net_timeout's non-obvious contribution to the solution is that the master will now actively send traffic to keep an otherwise-idle connection alive, every 30 seconds (60/2), with the fallback being that after 60 seconds of nothing at all, the slave will automatically drop the connection and reconnect to the master -- effectively the same as you are doing by stopping and starting the slave -- though this should never happen if the connection is intact, because the master will be sending those heartbeats as needed.

If this solves your issue, remembe that you also need to make the change to slave_net_timeout persistent by updating my.cnf and restarting the server -- otherwise the setting will revert the next time the server restarts, and the default value before MySQL 5.7 is 3600.

You could, alternatively, simply change MASTER_HEARTBEAT_PERIOD to a smaller value, but this only fixes half the problem. When the connection really fails, the slave make take an excessive amount of time to notice it.


Unrelated: note that MASTER_CONNECT_RETRY = 5 is far too low. You want this much higher, or the slave may give up on the master far too quickly during an outage situation.

Michael - sqlbot
  • 22,658
  • 2
  • 63
  • 86