0

We're running a MySQL master-slave replication set-up and the slave is very slow at catching up. There seems to be no high network or cpu usage while the slave is going through the masters' logs. Connection between master and slave is stable and fast (15 MB/s avg.) and the slaves' cpu-usage does not exceed 2 % on every core (quad core cpu).

We've done some testing by letting the slave catch up until its on the same position as the master and then executing a lot of queries. While the masters' cpu-usage goes way up, the slave is actually doing a very poor job at getting the queries from the master. The Relay_Log_Pos is increasing at a very slow tempo and Exec_Master_Log_Pos is very close to the Relay_Log_pos.

To me it seems to be networking related as the Relay_Log_Pos does not increase very fast, causing the slave to not have many queries to execute, or am I missing something here?

Qlii256
  • 151
  • 2
  • 8
  • IO contention? How much time are the slave's CPU cores spending waiting for IO? – EEAA Dec 31 '15 at 23:45
  • iowait has an average of 46.94 % and a maximum of 96.32 % while the slave is working. I also have to mention that these values are measured by Munin and the maximum cpu is 400 % (96 % is not full cpu usage of course). – Qlii256 Dec 31 '15 at 23:52
  • 1
    Well, you've found your problem. If this behavior is problematic for you, you'll need to improve the slave's storage subsystem. – EEAA Dec 31 '15 at 23:52
  • Thanks for you comment, was already afraid of that. Is there anything I could do to improve this? – Qlii256 Dec 31 '15 at 23:53
  • Is the slave fielding reads from your application? – EEAA Dec 31 '15 at 23:55
  • 2
    Since this is a slave, `SET GLOBAL innodb_flush_log_at_trx_commit = 2` should be a reasonable option on the slave... have you tried that? It can definitely be the difference between a slave that can't keep up, and one that can, when I/O blocking is an issue. http://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit – Michael - sqlbot Jan 01 '16 at 04:22
  • agree on innodb_flush_log_at_trx_commit=2, you'd also have to check your disk IO scheduler and lastly try experimenting with multi-threaded slave replication. – jerichorivera Jan 01 '16 at 04:50
  • Will defenetly try that. Both master and slave are being used in the development environment and are not the final products used for production. It will however be nice to make the slave a bit faster. – Qlii256 Jan 01 '16 at 08:54
  • I see that setting innodb_flush_log_at_trx_commit=2 could possibly result in data loss up to one second, but as it's a development-server, it's not really of an issue (as I also have cache=writeback on that machine). – Qlii256 Jan 01 '16 at 09:00
  • I've just ran a test with the setting innodb_flush_log_at_trx_commit on 2, but the slave succeeds in about the exact same time and the iowait is on an average of 44.82 % and a maximum of 89.96 %, which is the same as before. I've verified that the setting is indeed correctly set with "SHOW VARIABLES LIKE 'innodb_flush_log_at_trx_commit';" which returns 2. I've also restarted the server to make sure the same session is not used. Any other ideas what I could try? – Qlii256 Jan 01 '16 at 09:47
  • By restarting the mysql-server, the setting (variable) was somehow reset (as this is normal). I've now added this variable inside the my.cnf and guess what, it's going super fast now! It's done in just a fraction of a second, where before it took 6 minutes. Is this normal behaviour? A checksum on the tables reveals that they are equal on master and slave. I did some reading on this particular variable but could not find a good explanation of why it's so much faster. – Qlii256 Jan 01 '16 at 10:18
  • Did some further reading and it seems that in fact, now it is pushing every commit to the log file. Every second (the docs state it's not always exact one second), it will then be written to the disk. When having the default value, it would write it every commit to disk, causing huge iowait. This is because the slave can run more then one query per second, resulting in writing many queries at once instead of writing each query separately. The security risk here is that when the server crashes before it got to write, that data is lost (approx. 1 second of data). – Qlii256 Jan 01 '16 at 11:45
  • 1
    read about crash safe slaves, but in all honesty as long as you have the master you can always rebuild slave or resync slave with pt-table-checksum/pt-table-sync – jerichorivera Jan 01 '16 at 12:32
  • I've enabled storing the log position to a table rather then a file making it more secure (crash safe) now. Thank you for notifying that! The slave is going blazingly fast now and can keep up very well with the master. It's just small seconds behind the master and is not going over high cpu usage (was caused by the iowait). – Qlii256 Jan 01 '16 at 13:46
  • @Michael-sqlbot Can you post it as an answer, as I would like to accept it. It's working great with this fix, thank you again! – Qlii256 Jan 02 '16 at 09:22

1 Answers1

1

45%-96% iowait is quite high.

You'll need to improve your disk write speeds somehow. You could purchase faster disk or upgrade to a RAID controller that has a larger amount of write cache.

EEAA
  • 109,363
  • 18
  • 175
  • 245
  • I was afraid of that. I've tried enable cache=writeback (the slave is a qemu-kvm virtual machine), but that does not really speed things up. I guess the host will need ssl drives or just faster disks overall... – Qlii256 Jan 01 '16 at 01:03