We have a CentOS 6.9 server, with 4 core CPU and 32GB RAM.
Everyday at around the same time the load average gradually changes from 0.0
upto 11
when we finally have to restart the server.
CPU
: Spikes sometimes, when processes such as spamd, fail2ban take a few seconds to do something. Otherwise it is around 1%.
For a few moments php too takes >50% CPU.
It is at least 70% idle most of the time.
I/O
: There isn't much I/O action, but sometimes mysqld takes 99.5% of the I/O
RAM
: is always within acceptable range.
Bandwidth
: not much changes during the time. Acceptable range.
Disk Space
: More than 1TB free space.
AV
: Ran clamd and other tools, found some in the wordpress installation. Now removed.
Even when things are low, load average keeps on increasing and the server becomes too slow to do anything, and we are forced to restart. Then things turn normal.
This is not a cronjob issue, as I too thought because of the regular routine, it must be cron. So I used service crond stop
and stopped it for a few hours before lag begins. Lag still happens.
There are many processes running during the high load average time. Some are:
multiple mysqld
, ../bin/suexec 501 501 php5
, /bin/php
, /fail2ban
processes
I also get many emails from the server stating System Load Alert 1 for mysite.com
As it happens regularly at the same time, hardware issue doesn't seem to be the reason.
My question is, what else could I check to resolve this? I was betting all my chips on the cronjob.
Update 1:
Swap Space: Checked using sar -W -f /var/log/sa/sa15
All the values are 0
Using free -h
got
total used free shared buffers cached
Mem: 16G 2.9G 13G 18M 0B 1.4G
-/+ buffers/cache: 1.4G 14G
Swap: 0B 0B 0B
So it appears there isn't any swap space, but with this amount of free RAM, I doubt we really need it.
Update 2:
Checked with iotop
write/read speed, went to 250 KB/s for a few seconds that is it.
Result of I/O operation during the slowness, using iotop -aoP
:
Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s
PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
1508 be/4 root 7.88 M 24.00 K 0.00 % 1.32 % perl -T -w /usr/local/cpanel/3rdparty/bin/spamd --max-spare=1 --~llowed-ips=127.0.0.1,::1 --pidfile=/var/run/spamd.pid --listen=5
9288 be/4 root 248.00 K 16.00 K 0.00 % 2.42 % tailwatchd - chkservd - spamd check
1714 be/4 root 0.00 B 4.00 K 0.00 % 0.24 % queueprocd - wait to process a task
5610 be/4 root 8.00 K 152.00 K 0.00 % 0.03 % tailwatchd
1446 be/4 mysql 8.79 M 256.00 K 0.00 % 0.02 % mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr~0 --pid-file=/var/lib/mysql/..
Result of find /proc/*/task/. -name stat -exec grep ' D ' {} \;
varies a lot, sometimes there is nothing, sometimes different processess. I will post the ones that happen when load average is really high, around 18:
1585 (fail2ban-server) D 1 1575 1575 0 -1
16943 (cpsrvd (SSL) - ) D 1 16943 16942 0 -1
17221 (tailwatchd) D 1 17220 17220 0 -1
17255 (nginx) D 17253 17253 17253 0 -1
18102 (mysqld) D 17491 17479 12176 3482 0 -1
18355 (mysqld) D 17491 17479 12176 3482 0 -1
18099 (httpd) D 18087 18087 18087 0 -1
18127 (httpd) D 18087 18087 18087 0 -1
18312 (exim) D 1 17295 17295 0 -1
18375 (php) D 18096 18087 18087 0 -1
18379 (exim) D 18368 18368 18368 0 -1
18408 (find) D 18144 18107 18107 0 -1
18410 (suexec) D 18095 18087 18087 0 -1