Random server load spikes, number of tasks doubles, CPU/Memory consistent

Question

We run a vBulletin forum with 6,000 visitors per day on an SSD VPS with 5GB RAM and 24 CPU Cores. On a typical day we see 30-50 concurrent users at any given moment. On busy mornings, it'll go up to 130. During both of these times, the site is lightning fast & responsive (busy mornings make no impact on speed). We love the VPS and have optimized many of our queries to keep the site running quickly.

Unfortunately, 2-3 times per week, we experience a sudden load average spike that slows down the site and makes it unusable for 3-6 minutes. Every time I open a support ticket with our host, they blame it on high traffic, but logs show that the traffic levels aren't any higher than usual. In fact, the spikes usually happen later in the day (I've noticed on Tuesdays and Thursdays) when things are much quieter. No cron jobs running during the spikes, or scheduled tasks from vBulletin.

So then they blame it on possible development issues & bad queries, but again, the SQL processlist shows nothing unusual, and I'd think development issues would lead to consistent performance problems, not just a spike twice per week.

Here's what our server looks like on a regular day (40 people online right now)

total       used       free     shared    buffers     cached
Mem:          5120       3408       1711          0          0       2466
-/+ buffers/cache:        942       4177
Swap:            0          0          0

Top

top - 10:56:51 up 8 days,  1:05,  1 user,  load average: 0.80, 0.86, 1.00
Tasks:  75 total,   3 running,  72 sleeping,   0 stopped,   0 zombie
Cpu(s):  4.7%us,  0.4%sy,  0.0%ni, 94.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   5242880k total,  3532408k used,  1710472k free,        0k buffers
Swap:        0k total,        0k used,        0k free,  2538048k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
11717 nobody    20   0 57356  23m 6100 S 17.6  0.5   0:00.67 httpd
11609 nobody    20   0 57312  24m 6256 S  6.0  0.5   0:03.81 httpd
11672 nobody    20   0 55324  21m 6064 S  6.0  0.4   0:01.50 httpd
11689 nobody    20   0 55336  21m 6196 S  6.0  0.4   0:01.96 httpd
11675 nobody    20   0 55312  21m 6008 S  5.3  0.4   0:01.86 httpd
11708 nobody    20   0 55056  21m 5796 S  5.3  0.4   0:01.06 httpd
11624 nobody    20   0 55320  21m 6108 S  5.0  0.4   0:04.24 httpd
11669 nobody    20   0 56680  23m 6172 S  5.0  0.4   0:01.91 httpd
11688 nobody    20   0 59336  25m 6048 S  4.7  0.5   0:02.04 httpd
11704 nobody    20   0 62324  28m 5752 S  4.7  0.6   0:01.93 httpd
11674 nobody    20   0 55144  21m 5680 S  4.3  0.4   0:01.72 httpd
11715 nobody    20   0 56860  22m 5496 S  4.3  0.4   0:00.41 httpd
11489 nobody    20   0 57384  23m 6308 S  4.0  0.5   0:05.46 httpd
11492 nobody    20   0 56516  23m 6256 S  4.0  0.4   0:05.66 httpd
11631 nobody    20   0 55028  21m 5940 S  4.0  0.4   0:02.59 httpd
11645 nobody    20   0 57924  24m 6108 S  4.0  0.5   0:02.84 httpd
11666 nobody    20   0 55564  21m 5924 S  4.0  0.4   0:02.09 httpd
11633 nobody    20   0 55568  21m 5944 S  3.7  0.4   0:02.17 httpd
11691 nobody    20   0 59444  25m 6000 S  3.7  0.5   0:02.03 httpd
15004 mysql     15  -5 1260m 455m 4896 S  3.7  8.9 598:25.20 mysqld
11630 nobody    20   0 57096  23m 6272 S  3.3  0.5   0:03.72 httpd
11670 nobody    20   0 55032  20m 5844 S  3.3  0.4   0:01.57 httpd
11685 nobody    20   0 55064  21m 6028 S  3.3  0.4   0:01.89 httpd
11614 nobody    20   0     0    0    0 Z  0.3  0.0   0:03.68 httpd <defunct>
11682 nobody    20   0 55064  21m 6004 S  0.3  0.4   0:02.08 httpd
    1 root      20   0  2900  980  800 S  0.0  0.0   0:00.25 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd/11679
    3 root      20   0     0    0    0 S  0.0  0.0   0:00.00 khelper/11679
  134 root      16  -4  2464  624  332 S  0.0  0.0   0:00.00 udevd
  568 root      20   0 37000 2088  792 S  0.0  0.0   0:03.94 rsyslogd
  581 named     20   0  331m  41m 2112 S  0.0  0.8   6:17.95 named
  677 root      20   0  8944  988  476 S  0.0  0.0   0:00.02 sshd
  684 root      20   0  3264  744  564 S  0.0  0.0   0:00.00 xinetd
 1526 root      20   0  3100 1064  828 S  0.0  0.0   0:02.67 dovecot
 1529 dovenull  20   0  7092 2648 2076 S  0.0  0.1   0:00.04 pop3-login
 1530 dovenull  20   0  7232 3040 2364 S  0.0  0.1   0:00.67 imap-login
 1531 dovecot   20   0  2956 1012  872 S  0.0  0.0   0:01.03 anvil
 1532 root      20   0  3084 1196  896 S  0.0  0.0   0:00.93 log
 1535 root      20   0  3764 1800 1024 S  0.0  0.0   0:02.52 config
 1537 dovenull  20   0  7244 2996 2336 S  0.0  0.1   0:00.18 pop3-login
 1541 dovenull  20   0  7392 3112 2344 S  0.0  0.1   0:01.57 imap-login
 1561 root      20   0  2988  520  372 S  0.0  0.0   0:00.00 atd
 2152 root      20   0 12476 6792 2492 S  0.0  0.1   0:00.09 leechprotect
 4916 root      20   0 13820 8048 1332 S  0.0  0.2   0:08.36 lfd - sleeping
11340 nobody    20   0 63296  29m 6352 S  0.0  0.6   0:07.12 httpd

As you can see, the load average is pretty healthy. Consistently around 1-7%, and never above that (except during the spikes).

Here is a top during the spike. As you can see, the number of sleeping tasks is more than double what we usually see. 30 people were online at this time.

top - 17:32:38 up 2 days, 7:41, 1 user, load average: 33.61, 11.77, 5.56
Tasks: 168 total, 3 running, 163 sleeping, 0 stopped, 2 zombie
Cpu(s): 5.3%us, 0.3%sy, 0.0%ni, 94.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 5242880k total, 5081012k used, 161868k free, 0k buffers
Swap: 0k total, 0k used, 0k free, 3327124k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3838 nobody 20 0 59124 22m 5984 S 14.6 0.4 0:01.01 httpd
3861 nobody 20 0 59120 21m 5744 S 10.6 0.4 0:00.60 httpd
3799 nobody 20 0 59140 22m 6028 S 8.6 0.4 0:00.75 httpd
3840 nobody 20 0 58852 21m 5972 S 7.3 0.4 0:00.88 httpd
15004 mysql 15 -5 1247m 362m 5224 S 7.0 7.1 154:57.98 mysqld
3800 nobody 20 0 60580 23m 5988 S 6.0 0.5 0:01.23 httpd
3854 nobody 20 0 58920 21m 5840 R 6.0 0.4 0:00.68 httpd
3878 nobody 20 0 58824 21m 5852 S 5.3 0.4 0:00.91 httpd
3779 nobody 20 0 59108 22m 6004 S 4.7 0.4 0:02.62 httpd
3844 nobody 20 0 58828 21m 5816 S 4.7 0.4 0:00.65 httpd
3855 nobody 20 0 58868 21m 5872 S 4.7 0.4 0:00.63 httpd
3867 nobody 20 0 61600 24m 5968 S 4.7 0.5 0:01.07 httpd
3892 nobody 20 0 59080 21m 5620 S 4.7 0.4 0:01.00 httpd
3901 nobody 20 0 59196 22m 5760 S 4.7 0.4 0:01.07 httpd
3862 nobody 20 0 58824 21m 5600 S 4.3 0.4 0:00.97 httpd
3683 nobody 20 0 60400 23m 5940 R 4.0 0.4 0:01.50 httpd
3560 nobody 20 0 61712 24m 6140 S 3.7 0.5 0:06.92 httpd
3793 nobody 20 0 58848 21m 5988 S 3.7 0.4 0:00.74 httpd
3837 nobody 20 0 59352 21m 5816 S 3.7 0.4 0:00.54 httpd
3856 nobody 20 0 59352 21m 5632 S 3.7 0.4 0:00.94 httpd
3768 nobody 20 0 61172 23m 6000 S 3.3 0.5 0:02.02 httpd
3772 nobody 20 0 58880 21m 6020 S 3.3 0.4 0:02.57 httpd
3826 nobody 20 0 58856 21m 5568 S 3.3 0.4 0:00.75 httpd
3850 nobody 20 0 60356 23m 6156 S 3.3 0.5 0:00.96 httpd
3841 nobody 20 0 58824 21m 5676 S 3.0 0.4 0:01.15 httpd
3784 nobody 20 0 58848 21m 6008 S 0.7 0.4 0:01.61 httpd
1545 root 20 0 52776 16m 6564 S 0.3 0.3 1:54.88 httpd
3572 nobody 20 0 0 0 0 Z 0.3 0.0 0:07.76 httpd <defunct>
3807 nobody 20 0 59104 21m 5844 S 0.3 0.4 0:00.81 httpd
3820 nobody 20 0 59160 21m 5756 S 0.3 0.4 0:00.76 httpd
3868 nobody 20 0 58892 21m 5632 S 0.3 0.4 0:00.79 httpd
3884 nobody 20 0 58824 21m 5792 S 0.3 0.4 0:00.22 httpd
1 root 20 0 2900 988 808 S 0.0 0.0 0:00.10 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd/11679
3 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khelper/11679
134 root 16 -4 2464 500 260 S 0.0 0.0 0:00.00 udevd
568 root 20 0 37000 1304 776 S 0.0 0.0 0:00.87 rsyslogd
581 named 20 0 329m 39m 2164 S 0.0 0.8 1:52.02 named
677 root 20 0 8944 988 476 S 0.0 0.0 0:00.00 sshd
684 root 20 0 3264 744 564 S 0.0 0.0 0:00.00 xinetd
1526 root 20 0 3100 1064 828 S 0.0 0.0 0:00.73 dovecot
1529 dovenull 20 0 7092 2660 2088 S 0.0 0.1 0:00.02 pop3-login
1530 dovenull 20 0 7232 2956 2344 S 0.0 0.1 0:00.21 imap-login
1531 dovecot 20 0 2956 1012 872 S 0.0 0.0 0:00.27 anvil
1532 root 20 0 3084 1160 896 S 0.0 0.0 0:00.26 log

My host investigated and finally admitted this specific instance was due to io problems caused by another VPS in my node, but denied any other spikes being related to that issue. This happened again 2 days later and they blamed it on reaching Apache MaxClients, which was technically true, but my point is that we never should have reached it in the first place. Again, there were only 30 people online (and the server handled 130 people online earlier that same morning with absolutely no problem). My suspicion is that the io issue causes all of these tasks to queue up and wait, so the queue just gets longer & longer, eventually violating MaxClients and spiking up the load average.

If this is indeed the case, I don't want to raise my MaxClients because of a server issue that shouldn't be happening in the first place. We are never anywhere near our MaxClients on a regular day. Sure, if my load average was usually hovering around 90%, then I'd be willing to blame spikes on traffic issues. But my load average is always from 1-7%, and there are never traffic spikes during the spike. A bi-daily spike past 100% doesn't make any sense.

I checked sar during the admitted io issues, and I don't see anything that indicates io problems, so I don't know how to check for io issues the next time this occurs:

04:40:01 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
04:50:01 PM     all      8.60      0.00      0.55      0.00      0.36     90.48
05:00:01 PM     all      8.18      0.00      0.54      0.02      0.20     91.06
05:10:01 PM     all      8.44      0.14      0.55      0.05      0.17     90.65
05:20:01 PM     all      8.28      0.00      0.53      0.01      0.15     91.02
05:30:01 PM     all      7.02      0.00      0.47      0.00      0.11     92.40

At the time of my top (17:32 / 5:32PM) with 30+ load average, the iowait was %0.00. How did they determine this to be an io issue? And how can I determine this on my own in the future?

So, in conclusion, I have two questions:

Do these symptoms indicate any specific issues that would explain what's going on?
Can you suggest any monitoring tools that would help me pinpoint the issue and prove the root cause?

Thanks!

Random server load spikes, number of tasks doubles, CPU/Memory consistent

0 Answers0