Very high load average results in server lag

Question

We have a CentOS 6.9 server, with 4 core CPU and 32GB RAM.

Everyday at around the same time the load average gradually changes from 0.0 upto 11 when we finally have to restart the server.

CPU: Spikes sometimes, when processes such as spamd, fail2ban take a few seconds to do something. Otherwise it is around 1%. For a few moments php too takes >50% CPU. It is at least 70% idle most of the time.

I/O : There isn't much I/O action, but sometimes mysqld takes 99.5% of the I/O

RAM: is always within acceptable range.

Bandwidth : not much changes during the time. Acceptable range.

Disk Space: More than 1TB free space.

AV: Ran clamd and other tools, found some in the wordpress installation. Now removed.

Even when things are low, load average keeps on increasing and the server becomes too slow to do anything, and we are forced to restart. Then things turn normal.

This is not a cronjob issue, as I too thought because of the regular routine, it must be cron. So I used service crond stop and stopped it for a few hours before lag begins. Lag still happens.

There are many processes running during the high load average time. Some are: multiple mysqld, ../bin/suexec 501 501 php5, /bin/php, /fail2ban processes

I also get many emails from the server stating System Load Alert 1 for mysite.com

As it happens regularly at the same time, hardware issue doesn't seem to be the reason.

My question is, what else could I check to resolve this? I was betting all my chips on the cronjob.

Update 1: Swap Space: Checked using sar -W -f /var/log/sa/sa15 All the values are 0 Using free -h got

             total       used       free     shared    buffers     cached
Mem:           16G       2.9G        13G        18M         0B       1.4G
-/+ buffers/cache:       1.4G        14G
Swap:           0B         0B         0B

So it appears there isn't any swap space, but with this amount of free RAM, I doubt we really need it.

Update 2: Checked with iotop write/read speed, went to 250 KB/s for a few seconds that is it.

Result of I/O operation during the slowness, using iotop -aoP :

 Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s
  PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
 1508 be/4 root          7.88 M     24.00 K  0.00 %  1.32 % perl -T -w /usr/local/cpanel/3rdparty/bin/spamd --max-spare=1 --~llowed-ips=127.0.0.1,::1 --pidfile=/var/run/spamd.pid --listen=5
 9288 be/4 root        248.00 K     16.00 K  0.00 %  2.42 % tailwatchd - chkservd - spamd check
 1714 be/4 root          0.00 B      4.00 K  0.00 %  0.24 % queueprocd - wait to process a task
 5610 be/4 root          8.00 K    152.00 K  0.00 %  0.03 % tailwatchd
 1446 be/4 mysql         8.79 M    256.00 K  0.00 %  0.02 % mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr~0 --pid-file=/var/lib/mysql/..

Result of find /proc/*/task/. -name stat -exec grep ' D ' {} \; varies a lot, sometimes there is nothing, sometimes different processess. I will post the ones that happen when load average is really high, around 18:

1585 (fail2ban-server) D 1 1575 1575 0 -1
16943 (cpsrvd (SSL) - ) D 1 16943 16942 0 -1
17221 (tailwatchd) D 1 17220 17220 0 -1
17255 (nginx) D 17253 17253 17253 0 -1 
18102 (mysqld) D 17491 17479 12176 3482 0 -1
18355 (mysqld) D 17491 17479 12176 3482 0 -1
18099 (httpd) D 18087 18087 18087 0 -1 
18127 (httpd) D 18087 18087 18087 0 -1 
18312 (exim) D 1 17295 17295 0 -1
18375 (php) D 18096 18087 18087 0 -1
18379 (exim) D 18368 18368 18368 0 -1
18408 (find) D 18144 18107 18107 0 -1
18410 (suexec) D 18095 18087 18087 0 -1

Probably I/O is a problem. What does `grep ' D ' /proc/*/task/*/stat` say when the load is high? And do you have enough free swap space? — kasperd, Jul 16 '17 at 17:26
Aare you using nginx + php-fpm or apache + mod_php5? There's a big difference in performance tuning. Also, can you please paste the non-secret portions of your mysql config? — 2ps, Jul 16 '17 at 17:41
@kasperd: Running that command gives an error saying 'stat' file doesn't exist, as some folders don't seem to have it. Will run `find . -name stat -exec grep ' D ' {} \;` and update the post. Thanks! — NewServerGuy, Jul 17 '17 at 09:07
@2ps: It is Apache + FastCGI PHP 5.6.3. We have nginx installed and active, but our website uses Apache. — NewServerGuy, Jul 17 '17 at 09:10
AV: Ran clamd and other tools, found some in the wordpress installation. -> Server compromised, nuke and reinstall. — TomTom, Jul 17 '17 at 09:51
@TomTom: We are on an unmanaged server, would be easier just to move it. Would just moving it and transferring site files and database help? — NewServerGuy, Jul 17 '17 at 09:56
No idea. Just telling you that if you found a virus in a non-file-storage area the server is compromised. Which means you MAY have malware on it that you are not aware of. — TomTom, Jul 17 '17 at 10:02
@TomTom: Okay, will check it out. But isn't wordpress installation area a file storage area? — NewServerGuy, Jul 17 '17 at 10:08
@NewServerGuy A wordpress INSTALLATION is RUNNING on the server. — TomTom, Jul 17 '17 at 11:04
Is there a category to tag this as a "whack-a-mole" question? — Daniel Ferradal, Jul 17 '17 at 11:07
@TomTom: OK, but then what is the file storage area of a server? What do you store there? I assumed you meant system files vs user files, so user files are file storage area. Sorry if I misunderstood. — NewServerGuy, Jul 17 '17 at 11:11
That is stuff that is not executed. If your server is a FTP Server and never executes the stuf, or a SMB/NFS server for workstations. — TomTom, Jul 17 '17 at 11:14
@TomTom: Ah! Okay. Coudn't +1, not enough karma. I updated the post with a few more results if you'd like to take a look. I will see if we can move things. Thanks! — NewServerGuy, Jul 17 '17 at 11:19
@NewServerGuy It is normal for `grep ' D ' /proc/*/task/*/stat` to output a single error message. What's interesting is the rest of the output. The large amount of free memory can mean one of two things. Either you really have more memory than you need, or something has recently been using a lot of memory and then freed it up by the time you looked. One way one could have told the difference between the two is from whether something has been swapped out. But since the machine doesn't have a swap partition we cannot deduce anything that way. — kasperd, Jul 17 '17 at 21:44
@NewServerGuy If you do have more memory than you need the kernel would be caching everything it ever read from disk or wrote. So you should see a growing number in `cached` to the point where every file you are using resides in cache. At that point the system will no longer need to do any reads and I/O will entirely be from writes being flushed to disk. But the number in `cached` seems a bit lower than I would expect in that scenario. And it also looks like you had a `find` command waiting on I/O. Unless the output was redirected to a file, I wouldn't expect `find` to do any writes. — kasperd, Jul 17 '17 at 21:49
@kasperd: What number for the `cached` should be the red zone. Over 20% i.e. 3.2GB? I will check it again when it lags. Right now what I am doing is, restarting the server at a random time, to check if it lags 24hrs from previous restart or if it lags at a certain time of the day. Thanks for your help. — NewServerGuy, Jul 18 '17 at 11:31
It is not like there is a right or wrong value. If all the data the server ever needed to touch does in fact fit in just 1.4GB it is perfectly fine for the cache size to be no larger than that. All we have determined so far is that the machine was in fact busy doing I/O, but we haven't determined why. If we knew what the memory usage was over time that might tell us something. — kasperd, Jul 18 '17 at 20:51

score 0 · Accepted Answer · answered Aug 01 '17 at 00:40

0

In my case as the server had malicious files before. Maybe a caches copy remained.

Clearing website and server cached solved the issue.

answered Aug 01 '17 at 00:40

NewServerGuy

101
4

Very high load average results in server lag

1 Answers1