Apache 2.4.9 100% CPU load of random workers

Question

I own a t1.micro EC2 instance. After release upgrade of Ubuntu to saucy I got Apache 2.4.6 and I started noticing 100% CPU load like this

top - 19:37:58 up  2:55,  2 users,  load average: 3.90, 2.90, 1.82
Tasks:  95 total,   4 running,  91 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.9 us,  7.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 88.9 st
KiB Mem:    629976 total,   588412 used,    41564 free,    39412 buffers
KiB Swap:  2097144 total,        0 used,  2097144 free,   326932 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
 6688 www-data  20   0  114m  17m  12m R  34.3  2.9   3:07.74 apache2
 6719 www-data  20   0  113m  10m 6052 R  33.3  1.8   3:16.99 apache2
 6721 www-data  20   0  113m 9.8m 5004 R  32.0  1.6   3:04.13 apache2

I decided to install another Apache version from this ppa, so I got 2.4.9 now, but the problem is still there.

What I tried: 1. removed all modules and added only necessary ones. At the moment the loaded ones are:

lrwxrwxrwx  1 root root   36 Apr  9 19:16 access_compat.load -> ../mods-available/access_compat.load
lrwxrwxrwx  1 root root   28 Apr  9 19:17 alias.conf -> ../mods-available/alias.conf
lrwxrwxrwx  1 root root   28 Apr  9 19:17 alias.load -> ../mods-available/alias.load
lrwxrwxrwx  1 root root   33 Apr  9 19:18 authz_core.load -> ../mods-available/authz_core.load
lrwxrwxrwx  1 root root   38 Apr  9 19:18 authz_groupfile.load -> ../mods-available/authz_groupfile.load
lrwxrwxrwx  1 root root   33 Apr  9 19:18 authz_host.load -> ../mods-available/authz_host.load
lrwxrwxrwx  1 root root   33 Apr  9 19:18 authz_user.load -> ../mods-available/authz_user.load
lrwxrwxrwx  1 root root   26 Apr  9 19:18 dir.conf -> ../mods-available/dir.conf
lrwxrwxrwx  1 root root   26 Apr  9 19:18 dir.load -> ../mods-available/dir.load
lrwxrwxrwx  1 root root   27 Apr  9 19:19 mime.conf -> ../mods-available/mime.conf
lrwxrwxrwx  1 root root   27 Apr  9 19:19 mime.load -> ../mods-available/mime.load
lrwxrwxrwx  1 root root   34 Apr  9 19:11 mpm_prefork.conf -> ../mods-available/mpm_prefork.conf
lrwxrwxrwx  1 root root   34 Apr  9 19:11 mpm_prefork.load -> ../mods-available/mpm_prefork.load
lrwxrwxrwx  1 root root   27 Apr  9 14:39 php5.conf -> ../mods-available/php5.conf
lrwxrwxrwx  1 root root   27 Apr  9 14:39 php5.load -> ../mods-available/php5.load
lrwxrwxrwx  1 root root   30 Apr  9 19:19 rewrite.load -> ../mods-available/rewrite.load
lrwxrwxrwx  1 root root   29 Apr  9 19:20 status.conf -> ../mods-available/status.conf
lrwxrwxrwx  1 root root   29 Apr  9 19:20 status.load -> ../mods-available/status.load

reinstalling Apache to a newer version (see above, I'm on 2.4.9 now and still it is there)
stopping and starting the instance (hoping it would get onto another hardware) - no luck
The site I'm testing with is WP-based with W3TC plugin installed, I disabled minification of W3TC static files. I decide to do so because I noticed that it stucks sometimes on minified files in status module.
upgrading WP to the latest version (3.8.2) - nothing changed
Looking now at status module output, I'm seeing that there is no pattern as to what requests are stuck - absolute random. For example: http://pastebin.com/JxLbbzCB - see that POST /wp-admin/admin-ajax.php has been stuck for 62 seconds in W (sending reply), 5-0 request is also in W status for 250 seconds.

All these "W" workers load CPU by 100% and last really long causing EC2 instance to start CPU throttle.

UPDATE: here is strace info for one of hanged processes:

select(17, [16], [16], NULL, {1, 0})    = 1 (out [16], left {0, 999998})
send(16, "", 0, MSG_NOSIGNAL)           = 0
select(17, [16], [16], NULL, {1, 0})    = 1 (out [16], left {0, 999998})
send(16, "", 0, MSG_NOSIGNAL)           = 0
select(17, [16], [16], NULL, {1, 0})    = 1 (out [16], left {0, 999998})
send(16, "", 0, MSG_NOSIGNAL)           = 0
select(17, [16], [16], NULL, {1, 0})    = 1 (out [16], left {0, 999998})
send(16, "", 0, MSG_NOSIGNAL)           = 0
select(17, [16], [16], NULL, {1, 0})    = 1 (out [16], left {0, 999998})
send(16, "", 0, MSG_NOSIGNAL)           = 0
select(17, [16], [16], NULL, {1, 0})    = 1 (out [16], left {0, 999997})
send(16, "", 0, MSG_NOSIGNAL)           = 0
select(17, [16], [16], NULL, {1, 0})    = 1 (out [16], left {0, 999998})
send(16, "", 0, MSG_NOSIGNAL)           = 0
select(17, [16], [16], NULL, {1, 0})    = 1 (out [16], left {0, 999998})
send(16, "", 0, MSG_NOSIGNAL)           = 0
select(17, [16], [16], NULL, {1, 0})    = 1 (out [16], left {0, 999998})

and it lasts while the process is not killed

UPDATE: output of ps auxf|grep apache2 :

root@domU-12-31-39-02-26-E9:~# ps auxf|grep apache2
root      2761  0.0  0.1   4168   852 pts/1    S+   19:52   0:00                      \_ grep --color=auto apache2
root      2549  0.0  2.3 115720 15104 ?        Ss   19:50   0:00 /usr/sbin/apache2 -k start
www-data  2554  0.0  0.8 115800  5660 ?        S    19:50   0:00  \_ /usr/sbin/apache2 -k start
www-data  2555  0.2  3.6 117944 22872 ?        S    19:50   0:00  \_ /usr/sbin/apache2 -k start
www-data  2556  0.2  3.7 119252 23360 ?        S    19:50   0:00  \_ /usr/sbin/apache2 -k start
www-data  2557  9.9  1.2 115932  8068 ?        R    19:50   0:14  \_ /usr/sbin/apache2 -k start
www-data  2558  9.7  1.2 115932  8068 ?        R    19:50   0:14  \_ /usr/sbin/apache2 -k start
www-data  2562  0.0  0.8 115800  5660 ?        S    19:50   0:00  \_ /usr/sbin/apache2 -k start
www-data  2564  0.0  0.8 115800  5656 ?        S    19:50   0:00  \_ /usr/sbin/apache2 -k start
www-data  2566 20.7  1.2 115932  8044 ?        R    19:50   0:28  \_ /usr/sbin/apache2 -k start
www-data  2567 35.9  1.2 115932  8072 ?        R    19:50   0:49  \_ /usr/sbin/apache2 -k start
www-data  2568 10.8  1.2 115932  8080 ?        R    19:50   0:14  \_ /usr/sbin/apache2 -k start
www-data  2571  0.0  0.8 115800  5644 ?        S    19:51   0:00  \_ /usr/sbin/apache2 -k start
www-data  2572  0.0  0.8 115800  5644 ?        S    19:51   0:00  \_ /usr/sbin/apache2 -k start
www-data  2573  0.0  0.8 115800  5644 ?        S    19:51   0:00  \_ /usr/sbin/apache2 -k start
www-data  2574  0.0  0.7 115752  4900 ?        S    19:51   0:00  \_ /usr/sbin/apache2 -k start
root@domU-12-31-39-02-26-E9:~#

Output of tail /var/log/apache2/error.log (error logs configured for other vhosts are empty at the moment):

[Mon Apr 21 19:50:22.201343 2014] [:notice] [pid 2552] FastCGI: process manager initialized (pid 2552)
[Mon Apr 21 19:50:22.692477 2014] [mpm_prefork:notice] [pid 2549] AH00163: Apache/2.4.9 (Ubuntu) mod_fastcgi/mod_fastcgi-SNAP-0910052141 PHP/5.5.3-1ubuntu2.3 configured -- resuming normal operations
[Mon Apr 21 19:50:22.692580 2014] [core:notice] [pid 2549] AH00094: Command line: '/usr/sbin/apache2'

free -m

             total       used       free     shared    buffers     cached
Mem:           615        596         18          0        217        142
-/+ buffers/cache:        236        378
Swap:         2047         22       2025

df -h

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        30G   22G  6.1G  79% /
devtmpfs        299M   12K  299M   1% /dev
none             62M  208K   62M   1% /run
none            5.0M     0  5.0M   0% /run/lock
none            308M     0  308M   0% /run/shm
none            100M     0  100M   0% /run/user

You're on a micro instance. You have 88% steal time. That's even higher than usual. It's not surprising that you're having issues. — Michael Hampton, Apr 09 '14 at 19:54
yeah check this http://blog.scoutapp.com/articles/2013/07/25/understanding-cpu-steal-time-when-should-you-be-worried — Petter H, Apr 09 '14 at 19:56
@Michael: the steal appears only after these hanged workers have been taking 100% cpu for decent time. Initially, if apache is not running it's 0. This is CPU throttling. So this steal time is the consequence of what I'm asking here. See this: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts_micro_instances.html#when-instance-uses-allotted-resources — Alexey, Apr 10 '14 at 08:31
[select](http://en.wikipedia.org/wiki/Select_%28Unix%29) system call check status (ready to read or write) of some file descriptors (FD), in your case single FD with id `16`. Use `lsof` to see what FD 16 referred to. I think it some socket, because next call is `send` empty string in that FD, after that FD becomes ready to read and so on. — user1516873, Apr 12 '14 at 11:17
@Alexey is Apache configured with KeepAlive? if so try turning it off. — LinuxDevOps, Apr 17 '14 at 01:30
That `strace` shows a `send` call for zero bytes, which makes no sense. It makes even less sense after a read hit. Something seems very wrong with the logic of whatever process you were tracing. — David Schwartz, Apr 17 '14 at 03:55
@Alexey output of `ps auxf|grep apache2` , `tail /var/log/apache2/error.log` (or equivalent) , `free -m` and `df -h` ? — LinuxDevOps, Apr 17 '14 at 16:09
@LinuxDevOps: see I updated my question, these commands have been run when there was 100% cpu load and 90 steal. — Alexey, Apr 21 '14 at 20:00
@LinuxDevOps: I had also tried setting off of KeepAlive - doesn't help — Alexey, Apr 21 '14 at 20:00
@Alexey hmm, try unloading the status module (probably not being used), also do a `lsof pid` where pid is the pid of the apache processes with high CPU (like 2567 in your snippet) — LinuxDevOps, Apr 21 '14 at 20:29
@LinuxDevOps: sorry for the delay, but I updated today the whole distribution to 14.04 and now it's fine. Which make me think it was something like partial or failed upgrade of certain system packages, not the apache itself? Just to update everyone else: yes, apache 2.4 works fine on t1.micro with our site visits figures — Alexey, Apr 22 '14 at 22:28
@Alexey glad it's working now, I was going to suggest just trying it out in another cheap server/provider for a day or so (costs cents plus your time) and/or apt-get remove --purge ; apt-get install . I've seen before inexplicable "runaway" apache processes and I don't like the blanket advice of "buy bigger" when the problem causes haven't been identified. — LinuxDevOps, Apr 23 '14 at 01:37

score 3 · Answer 1 · answered Apr 16 '14 at 04:34

3

Heavy usage Apache 2.4 on a t1.mirco instance is like humpty dumpty sitting on a couple toothpicks. He's too big and heavy! So of course he's squashed the toothpicks to pieces and rolled down the hill.

You have at least two basic solutions to the root cause of this problem:

Get more server resources, and less steal time. Either with AWS, or go to Digital Ocean for value.
Stop using Apache 2 as the client facing daemon. Either migrate to something more lightweight like Lighttpd, Nginx, or implement a caching proxy layer like Varnish.

answered Apr 16 '14 at 04:34

taddy hoops

391
1
8

Thanks for your answer, but the problem is that the version of apache that was in 12.10 or 12.04 was absolute fine on this very instance. It only started with 2.4 after upgrading to 13.10 (because of Heartbleed). I'm also aware of varnish caching server or I can use nginx, but I don't feel like doing this knowing that 2.2 was fine on the same hardware + feeling that there is certain issue with the hanged processes. I will never believe that a worker process taking 100% cpu serving just a _static_ html page should take 100% cpu for 10+ minutes on t1.micro and this is "normal" – Alexey Apr 21 '14 at 20:04

score -2 · Answer 2 · answered Apr 17 '14 at 00:58

-2

Most likely the problem is memory shortage for apache. Apache doesn't have enough memory to to complete properly and thus get into unpredictable state resulting in the CPU loop.

I recommend to reduce the memory footprint for your VM (or move to a larger VM). Find the right setting for MaxClients by experimenting. Start with minimum setting and if this fixes CPU issue, increase MaxClients and MaxSpareServers until you find the balance.

Start with this (assuming you use prefork) and see if the issue persists:

ListenBacklog 10

<IfModule prefork.c>
StartServers         1
MinSpareServers      1
MaxSpareServers      1
MaxClients           1
MaxRequestsPerChild  1000
</IfModule>

answered Apr 17 '14 at 00:58

Alec Istomin

322
1
8

This seems unlikely given the amount of free physical memory he has and the fact that none of his swap is used. – David Schwartz Apr 17 '14 at 03:56
There is no history of memory utilization, shortage could have triggered cpu spike before the top snapshot – Alec Istomin Apr 17 '14 at 05:12
And then every single byte that was swapped out got swapped back in? That's totally implausible. To say it's "most likely" is completely unjustified. – David Schwartz Apr 17 '14 at 05:26
Not all memory can be swapped, plus uve seen similar cases, although with openvz, not xen used at AWS – Alec Istomin Apr 17 '14 at 05:36

Apache 2.4.9 100% CPU load of random workers

2 Answers2