After a power outage at our data centre, the slave MySQL databases are struggling.
This is in the logs for one of the slaves:
100118 10:05:56 [Note] Slave I/O thread: connected to master 'repl@db1:3306', replication started in log 'bin-log.004712' at position 724207814
100118 10:05:56 [ERROR] Error reading packet from server: Client requested master to start replication from impossible position ( server_errno=1236)
100118 10:05:56 [ERROR] Got fatal error 1236: 'Client requested master to start replication from impossible position' from master when reading data from binary log
100118 10:05:56 [Note] Slave I/O thread exiting, read up to log 'bin-log.004712', position 724207814
And the console shows a little more detail:
mysql> show slave status \G;
*************************** 1. row ***************************
Slave_IO_State:
Master_Host: db1
Master_User: repl
Master_Port: 3306
Connect_Retry: 10
Master_Log_File: bin-log.004712
Read_Master_Log_Pos: 724207814
Relay_Log_File: mysqld-relay-bin.000105
Relay_Log_Pos: 98
Relay_Master_Log_File: bin-log.004712
Slave_IO_Running: No
Slave_SQL_Running: Yes
Replicate_Do_DB: mmplive1,mmpjcr,fui
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 724207814
Relay_Log_Space: 98
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
1 row in set (0.00 sec)
ERROR:
No query specified
Looking at the bin logs on the master, we have:
-rw-rw---- 1 mysql mysql 724200412 Jan 18 09:22 bin-log.004712
-rw-rw---- 1 mysql mysql 1904 Jan 18 09:27 bin-log.index
-rw-rw---- 1 mysql mysql 5046830 Jan 18 11:22 slow-log
-rw-rw---- 1 mysql mysql 198249613 Jan 18 11:24 bin-log.004713
- The Slave status shows that the Exec_Master_Log_Pos and Read_Master_Log_Pos are both 724207814, for then binary log bin-log.004712. It is my understanding that this value is the byte position in the binary log file.
- That bin-log.004712 file is only 724200412 bytes, so the slaves think that they've done 7402 bytes more work than has actually been persisted in the file (which is on a ext3 fs, RAID-10, RHEL5). Hence the error message about impossible log position, etc.
What's the best way to fix the slaves?
Options I'm considering:
- Setting each slave to point at position 0 in the next bin-log file (bin-log.004713) and let them go, but I'm not sure how safe that is, or how much data might be lost.
- Do I instead need to do a full backup and restore (with presumed associated downtime due to table locks on the InnoDB tables)? I'd like to avoid that if possible.
UPDATE:
I missed off another option: to point each slave executed position back a little bit, so that it would try replicating commands that it had already processed from bin-log.004712.