2

I have encountered an issue with my replication setup. Under normal circumstances it performs as intended, but it breaks if it encounters a hard reboot (I pull the power cable).

Setup

The setup consists of two servers, server-1 and server-2. The replication setup is a circular replication. Both servers are slaves replicating from the other host as if it were the master.

Software

  • Red Hat 6.5
  • MySQL 5.1.73-3
  • MySQL-libs 5.1.73-3
  • MySQL-server 5.1.73-3

Redundancy configuration

The replication configuration is as follows for server-1, and is mirrored for server-2:

GRANT REPLICATION SLAVE ON *.* TO 'replicant';
FLUSH TABLES WITH READ LOCK;"

#Use "SHOW MASTER STATUS \G;" to determine log position and log file

UNLOCK TABLES;

# set status on slave server using the values retieved earlier
CHANGE MASTER TO MASTER_HOST='server-2', MASTER_USER='replicant', MASTER_LOG_FILE='[log file]', MASTER_LOG_POS=[log pos];"

START SLAVE;

Error information

This information is retrieved when the server-2 has encountered a hard reboot and the replication is broken.

server-1 slave status

*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: server-2
                  Master_User: replicant
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: bin.000004
          Read_Master_Log_Pos: 37550
               Relay_Log_File: relay.000427
                Relay_Log_Pos: 245
        Relay_Master_Log_File: bin.000004
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 37550
              Relay_Log_Space: 529
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 

server-2 slave status

*************************** 1. row ***************************
               Slave_IO_State: 
                  Master_Host: server-1
                  Master_User: replicant
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: bin.000002
          Read_Master_Log_Pos: 7208
               Relay_Log_File: relay.000010
                Relay_Log_Pos: 245
        Relay_Master_Log_File: bin.000002
             Slave_IO_Running: No
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 7208
              Relay_Log_Space: 529
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 

Relevant information from server-2 mysql-log-error.log

These logs are from the boot that caused the replication to fail

150821 12:52:35 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
150821 12:52:35  InnoDB: Initializing buffer pool, size = 8.0M
150821 12:52:35  InnoDB: Completed initialization of buffer pool
InnoDB: The log sequence number in ibdata files does not match
InnoDB: the log sequence number in the ib_logfiles!
150821 12:52:35  InnoDB: Database was not shut down normally!
InnoDB: Starting crash recovery.
InnoDB: Reading tablespace information from the .ibd files...
InnoDB: Restoring possible half-written data pages from the doublewrite
InnoDB: buffer...
InnoDB: Last MySQL binlog file position 0 106, file name /var/lib/mysql/bin.000002
150821 12:52:35  InnoDB: Started; log sequence number 0 259987
150821 12:52:35 [Note] Recovering after a crash using /var/lib/mysql/bin
150821 12:52:35 [Note] Starting crash recovery...
150821 12:52:35 [Note] Crash recovery finished.
150821 12:52:35 [ERROR] log  listed in the index, but failed to stat
150821 12:52:35 [ERROR] Error counting relay log space
150821 12:52:35 [ERROR] Failed to initialize the master info structure
150821 12:52:35 [Note] Event Scheduler: Loaded 0 events
150821 12:52:35 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.73-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution  

Replicant access

Among the things that I attempted in order to gather information and resolve the issue, was to attempt to access server-1 from server-2 on behalf of the replicant user. That attempt was not successful, and I got an "Access denied" message.

I therefore looked into grants on server-2, and got the following information.

SHOW GRANTS FOR "replicant"@"server-1";
ERROR 1141 (42000) at line 1: There is no such grant defined for user 'replicant' on host 'server-1;'

My attempts at finding a resolution

My attempts to solve this problem have so far been fruitless. No matter what I try, I always get the same answer from MySQL.

ERROR 1201 (HY000) at line 1: Could not initialize master info structure; more error messages can be found in the MySQL error log

Any configuration, FLUSH, RESET, or log manipulation that I have tried has resulted in this error message when I then attempt to either "START SLAVE;" or "CHANGE MASTER TO ...", to get the replication going again.

Duplicating server-1

Making server-2 a copy of server-1 is promising. I get server-2 working, but server-1 is unable to replicate from server-2. I copy the server using the following steps.

  1. stop the mysqld service on server-2
  2. Clear /var/lib/mysql/ on server-2
  3. rsync /var/lib/mysql/ from server-1 to server-2
  4. change master host on server-2 to be server-1
  5. start the mysqld service on server-2

It looks like this does the trick server-2, but server-1 is unable to replicate from server-2. I see the following in the slave status of server-1:

  Slave_IO_state: Waiting to reconnect after a failed master even read
Slave_IO_Running: No

I find this in the error logs on server-1

[Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'FIRST' at position 4
[ERROR] Error reading packet from server: Access denied; you need the REPLICATION SLAVE privilege for this operation ( server_errno=1227)

I have checked the grants for the replicant user on both servers, and they have not changed.

mysql> SHOW GRANTS FOR 'replicant';
+---------------------------------------------------+
| Grants for replicant@%                            |
+---------------------------------------------------+
| GRANT REPLICATION SLAVE ON *.* TO 'replicant'@'%' |
+---------------------------------------------------+
1 row in set (0.00 sec)

It seems like server-1 is having problems accessing the logs on server-2. I have attempted the following to fix this problem.

  • RESET MASTER; + RESET SLAVE; on both servers, and then setting up the slaves again using CHANGE MASTER TO ...;.
  • Changing the MASTER_LOG_FILE and MASTER_LOG_POS on server-1 before the rsync to match the master status of server-1.
  • Trying every permutation of MASTER_LOG_FILE and MASTER_LOG_POS manipulation that I can think of.

1 Answers1

1

I continued working on the problem, and put together a solution that worked for me.

Solution

This solution assumes that server-2 is the server with the broken replication. This solution needs to be mirrored if server-1 is the server that is broken.

  1. Disable incoming connections on both servers
    • SHOW FULL PROCESSLIST; should only show a localhost connection with "SHOW FULL PROCESS LIST" state, and perhaps some system user entries
    • One way of achieving this is to assign a new port to the mysql service that is not the default 3306, or if you are specifying a port, change it to something else.
      • E.g. add "port=33060" to the [mysqld] section in /etc/my.cnf
  2. Prevent the replication slave process from starting when the server starts, on both servers
    1. Add "skip-slave-start" to the [mysqld] section in /etc/my.cnf
    2. Restart the mysql server
      • service mysqld restart
  3. Reset slave and master configurations on server-1
    • RESET SLAVE;
    • RESET MASTER;
  4. Create a copy of the database on server-1, and copy it to server-2.
    1. FLUSH TABLES WITH READ LOCK;
    2. mysqldump --databases [replicated database] ... > ./dump.sql
    3. UNLOCK TABLES;
    4. scp ./dump.sql [user]@server-2:/home/[user]/
  5. Import the data into the database on server-2
    • As root, mysql < /home/[user]/dump.sql
  6. Reset slave and master configurations on server-2
    • RESET SLAVE;
    • RESET MASTER;
  7. Configure the replication slave on both servers
    • On server-1: CHANGE MASTER TO MASTER_HOST='server-2', MASTER_USER='replicant', MASTER_LOG_FILE="bin.000001", MASTER_LOG_POS=1;
    • On server-2: CHANGE MASTER TO MASTER_HOST='server-1', MASTER_USER='replicant', MASTER_LOG_FILE="bin.000001", MASTER_LOG_POS=1;
  8. If you receive the following error on server-2: ERROR 1201 (HY000): Could not initialize master info structure; more error messages can be found in the MySQL error log
    • Remove the relay mysql relay files, and attempt steps 6 and 7 again.
    • rm -rf /var/lib/relay*
  9. Enable incoming connections on both servers
    • Revert step 1
    • Restart the mysql server
      • service mysqld restart
  10. Start replication slaves on both servers
    • START SLAVE;
  11. Replication should now be working. Verify its health on both servers
    • SHOW SLAVE STATUS\G;
      • Slave_IO_State should not indicate an error
      • Slave_IO_Running and Slave_SQL_Running should both be in Yes state
  12. Clean up
    • Remove skip-slave-start entry from /etc/my.cnf on both servers
      • It is not necessary to restart the mysql server
    • Delete the dump.sql file on both servers