MySQL Slave Health Check

Question

I am working on a project like creating database config on fly , got an array of read replicas for mysql instance and keeping one open connection to each and keeping their statics in that service so when a client tries to connect a read replica than its suppose to return less busy one.

My questions are what should be the formula for it ?

I have only 2 variables so far , any improvements to that variables are welcome as well.

Is Remote Server is alive
How many active connections it has with Threads_connected
Is replication healthy

I'd be curious if you had already looked at something like this and what the pros/cons were versus dynamic client setup: http://forge.mysql.com/wiki/MySQL_Proxy — polynomial, Aug 24 '11 at 02:31
They dont suggest yo use mysql_proxy on production server, my code will return a hostname from array basicly, wont be forward any queries. I am hoping to improve dns resolition errors and abit more detailed load balancing and also some failover features. — Kevin Lee, Aug 24 '11 at 02:48

score 1 · Answer 1 · answered Aug 24 '11 at 02:26

You can do it with simple script or use Nagios plugins.

check_ping or check_icmp
check_mysql_health, something like this:

    define command{
        command_name    check_mysql_health
        command_line    $USER1$/check_mysql_health -t 20 --hostname $HOSTADDRESS$ --port $ARG1$ --username $ARG2$ --password $ARG3$ --mode $ARG4$ --warning $ARG5$ --critical $ARG6$
    }

    define service{
            use                     generic-service
            host_name               mysql_slave
            service_description     MySQL_threads-connected
            check_command           check_mysql_health!3306!user!password!threads-connected!30!40
    }

3.


    define service{
            use                     critical-service
            host_name               mysql_slave
            service_description     MySQL_slave-io-running
            check_command           check_mysql_health!3307!user!password!slave-io-running
            contact_groups          admin-sms
    }

    define service{
            use                     critical-service
            host_name               mysql_slave
            service_description     MySQL_slave-sql-running
            check_command           check_mysql_health!3307!user!password!slave-sql-running
            contact_groups          admin-sms
    }

Actually using nagios already but its for our applications, our application will request a healthy and less busy database server and that service will return it, so need more detailed measures. — Kevin Lee, Aug 24 '11 at 02:49

score 1 · Accepted Answer · answered Aug 24 '11 at 05:51

I have designed systems that direct query volume in a similar scenario. It may be interesting to include several more things:

average response time for real traffic to the candidate (not just a monitoring query)
number of queries in the last time period (60s, etc)
memory/cpu/disk utilization over some previous time period

I've previously given each of the resources a weight and then basically added them up. So from a single server you might get back:

memory 50(%)
cpu 40(%)
disk 4000 (iops, if you know the limit here making it a % is good)
ms 300 (msecs average response time)

this server's weight would be 4390(higher would be worse here). You can see here where maybe if CPU is less of a concern you can change its 'weight' in the calculation to make the decision of which client to use more accurate for your environment.

How you collect this can make a difference with how frequently it can be collected, and how reliable it will be (maybe a node has died since you made the list of least used servers). One approach is to run a reporting daemon on each candidate and query it when you get a client request, maybe via multicast. The reporting daemon can collect stats very frequently to make the decision information as accurate as possible.

It isn't clear how transient the config you are generating is, which is an important consideration when doing the distribution. Will you have clients connected for long periods of time? Is it possible that you need to disconnect and re-distribute clients because a server got overloaded? Perhaps something you've already considered.

Depending on how transient your that is and how much you know about the queries you could also stand to add some more data to the decision metrics:

expected client weight currently being served by candidate (if you give clients weights too)
data set already in memory (if your data size exceeds the memory capacity of the server and you have more than a couple servers you can improve your RAM utilization by balancing queries for specific data sets to servers that already have them in memory)
uptime of the server (a completely unloaded new box will usually get crushed in weight based scenarios where the decisions are made frequently)

Hopefully that helps! It is an interesting problem.

MySQL Slave Health Check

2 Answers2