MariaDB Galera cluster servers running at 100% CPU and load rising

Question

I have a Drupal application which has been running on a single MySQL database server for 12 months, and has been performing relatively well (apart from peak load events). We needed to be able to support much higher spikes than the current DB server allowed, and at 32GB there was not much gain to be had from simply vertically scaling the single DB server.

We decided to set up a new MariaDB Galera cluster with 2x 32GB instances. We matched the configuration as far as possible with the soon-to-be-obselete DB server.

After migrating to the new database servers, we noticed that the CPU usage on those instances was constantly at 100%, and load was steadily increasing. Over the course of 1 hour, load average went from 0.1 to 150.

Initially we thought it might have something to do with the synchronisation between servers, but even with 1 server turned off and no sync occurring the it was still maxing out CPU as long as the web application was making requests to it.

After a lot of experimentation I found that reducing a few of the configuration options had a profound effect on the CPU usage and load. After making the below changes, the load average has stabilised between 4 and 6 on both instances.

CPU utilisation & Load average

The questions

What are some possible reasons for such a dramatic difference in CPU usage between the old and new servers, despite essentially migrating the configuration from the old server?
Load is currently hovering between 4 and 6 (and this is a low traffic period for our website). What should I be looking at to try and reduce this value, and ensure that when the site gets hit with some real traffic it wont fall over?

Config changes

innodb_buffer_pool_instances

Original value: 500 (there are 498 tables total in all databases)
New value: 92

table_cache

Original value: 8
New value: 4

max_connections

Original value: 1000
New value: 400

Current configuration

Here is the full configuration file from one of the servers /etc/mysql/my.cnf

[client]
port    = 3306
socket    = /var/run/mysqld/mysqld.sock

[mysqld_safe]
socket    = /var/run/mysqld/mysqld.sock
nice    = 0

[mysqld]

binlog_format=ROW
default-storage-engine=innodb
innodb_autoinc_lock_mode=2
query_cache_type=1
bind-address=0.0.0.0

max_connections = 400
wait_timeout = 600
key_buffer_size    =  16M
max_allowed_packet  = 16777216
max_heap_table_size = 512M
table_cache = 92 
thread_stack    = 196608
thread_cache_size       = 8
myisam-recover         = BACKUP
query_cache_limit = 1048576
query_cache_size        = 128M
expire_logs_days  = 10
general_log = 0
max_binlog_size         = 10485760
server-id = 0
innodb_file_per_table
innodb_buffer_pool_size = 25G
innodb_buffer_pool_instances = 4
innodb_log_buffer_size = 8388608
innodb_additional_mem_pool_size = 8388608
innodb_thread_concurrency = 16
net_buffer_length = 16384
sort_buffer_size = 2097152
myisam_sort_buffer_size = 8388608
read_buffer_size = 131072
join_buffer_size = 131072
read_rnd_buffer_size = 262144
tmp_table_size = 512M

long_query_time = 1
slow_query_log = 1
slow_query_log_file = /var/log/mysql/mysql-slow.log

# Galera Provider Configuration
wsrep_provider=/usr/lib/galera/libgalera_smm.so
#wsrep_provider_options="gcache.size=32G"

# Galera Cluster Configuration
wsrep_cluster_name="xxx"
wsrep_cluster_address="gcomm://xxx.xxx.xxx.107,xxx.xxx.xxx.108"

# Galera Synchronization Congifuration
wsrep_sst_method=rsync
#wsrep_sst_auth=user:pass

# Galera Node Configuration
wsrep_node_address="xxx.xxx.xxx.107"
wsrep_node_name="xxx01"


[mysqldump]
quick
quote-names
max_allowed_packet  = 16777216

[isamchk]
key_buffer_size    = 16777216

score 3 · Answer 1 · answered Mar 14 '15 at 06:06

We ended up getting a Percona consultant to assist with this problem. The main issue they identified was a large number of EXPLAIN queries were being executed. Turns out this was some debugging code that was left enabled (devel.module query logging for drupal devs). Disabling this saw CPU usage fall off a cliff.

Guess what time we disabled the EXPLAIN queries?

There were a number of additional fixes which they recommended we implement.

Add a third node to the cluster to act as an observer and maintain the integrity of the cluster.
Add primary keys to tables that do not have one.
Change MyISAM tables to InnoDB.
Change wsrep_sst_method from rsync to xtrabackup-v2.
Set innodb_log_file_size to 512M.
Set innodb_flush_log_at_trx_commit to 2 as the cluster maintains the integrity of the data.

I hope this information helps anyone who runs into similar issues.

score 2 · Answer 2 · answered Mar 14 '15 at 00:34

innodb_buffer_pool_instances should not be a function of the number of tables. The manual advocates that each instance be no smaller than 1GB. So, I suggest that even 92 is much too high. But my.cnf says only innodb_buffer_pool_instances = 4??

table_cache = 92

Maybe your comments are messed up? 500 would be more reasonable for table_open_cache. (table_cache is the old name.)

This may be the problem:

query_cache_size = 128M

Whenever a write occurs, all entries in the QC for the table(s) involved are purged from the QC. Recommend no more than 50M. Or, better yet, turn the QC off completely.

You have the slowlog turned on. What does pt-query-digest say are the top couple of queries? (This may be your best way to get a handle on the problem.)

Yes you are correct on both these points. The percona engineer also recommended we disable QC, and table_open_cache is the correct name of that config. — nicksanta, Mar 14 '15 at 06:08