NGINX/PHP-FPM cpu spikes and runaway processes

Question

Recently I switched to NGINX/PHP-FPM to run my forums.

The majority of the time, the site runs beautifully, seriously quick and I'm really happy with it. Its on a 13 core/30+GB memory instance with AWS, so ample resources (was on 8 core, 16GB before with Apache.)

So, what happens is, the majority of the time we have 6 or 7 PHP-FPM processes and all is well with the world;

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
27676 apache    20   0  499m  34m  19m R 49.2  0.1   0:06.33 php-fpm
27669 apache    20   0  508m  48m  24m R 48.2  0.1   0:10.84 php-fpm
27661 apache    20   0  534m  75m  26m R 45.9  0.2   0:16.18 php-fpm
27671 apache    20   0  531m  69m  21m R 43.9  0.2   0:09.85 php-fpm
27672 apache    20   0  501m  41m  23m R 32.9  0.1   0:09.18 php-fpm
27702 apache    20   0  508m  40m  16m R 23.6  0.1   0:00.94 php-fpm

Well, kinda well. Lots of CPU used but theres only a few of them, so its kinda ok.

Then, seemingly out of nowhere, I spawn a bunch of processes (last time we had 52) and each one is using 8% CPU. You don't need to be good at this to know that 52 * 8 is A LOT.

Now, I have max_children set to 40 now (it was 50.)

pm.max_children = 40
pm.start_servers = 4
pm.min_spare_servers = 2
pm.max_spare_servers = 6
pm.max_requests = 100

Memory limit in php.ini is 128mb.

So, I understand why I get so many processes - thats fine, I have them configured. What I'm curious about is, if we have 8% cpu per process - is that too much? And, maybe my processes are staying alive too long?

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     26575  0.0  0.0 499572  4944 ?        Ss   18:23   0:01 php-fpm: master process (/etc/php-fpm.conf)
apache   28161 16.1  0.1 516644 47588 ?        S    19:06   0:08 php-fpm: pool www
apache   28164 18.0  0.1 525044 59644 ?        S    19:06   0:07 php-fpm: pool www
apache   28166 18.6  0.1 513152 41388 ?        R    19:06   0:06 php-fpm: pool www
apache   28167 23.2  0.1 515520 47092 ?        S    19:06   0:07 php-fpm: pool www
apache   28168 15.2  0.1 515804 49320 ?        S    19:06   0:04 php-fpm: pool www
apache   28171 17.3  0.1 514484 43752 ?        S    19:06   0:04 php-fpm: pool www

As I write this, its 7:08PM - so the children processes have been running for 2 minutes and have probably served a lot in that time (Theres 700 people on the forums atm.)

So - super keen to hear advice/criticism/opinions. I've had so much downtime lately I'm verging on setting up Apache again and I'd love to stick with this.

Thanks in advance.

EDIT
This is the Bitnami graph showing the spikes and how quickly they occur (this is 24h) CPU usage graph

EDIT #2
nginx.conf can be found here.

EDIT #3
I bumped my numbers up. Its looking good but still makes me a little nervous.

pm.max_children = 100
pm.start_servers = 25
pm.min_spare_servers = 25
pm.max_spare_servers = 50
pm.max_requests = 500

EDIT #4
So, I've had a few more downtimes and I've setup Splunk and New Relic to help me monitor what is going on. It seems that there is no CPU waiting time and I still have free memory.

top - 17:30:37 up 10 days, 19:20,  2 users,  load average: 24.61, 37.34, 25.68
Tasks: 151 total,  20 running, 131 sleeping,   0 stopped,   0 zombie
Cpu0  : 71.8%us, 27.9%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu1  : 73.7%us, 26.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 70.8%us, 29.2%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 69.3%us, 30.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  35062648k total, 28747980k used,  6314668k free,   438032k buffers
Swap:        0k total,        0k used,        0k free, 16527768k cached

Watching Splunk it doesn't seem like its traffic - I do get hammered by Googlebot a lot and I was suspicious of it, but nothing concrete.

Do you use some kind of php acceletator apc for example. Show your nginx.conf. And why are you setting up pm.nax_requests ?? and why pm.max_spare_servers is so low ?? I think pm.max_spare_servers should be at least set up to 30. — B14D3, Mar 13 '13 at 08:31
I have added a gist of my nginx.conf. What is the benefit of raising those numbers? — Christian, Mar 13 '13 at 08:44

score 1 · Accepted Answer · answered Mar 13 '13 at 10:52

Well, kinda well. Lots of CPU used but theres only a few of them, so its kinda ok

The amount of work a process does is given by the amount of CPU it uses - hence it's the % usage x the time it's being used for. So if your PHP processes are using up lots of CPU then that's usually a good thing - it means it's not waiting for other stuff to happen. From a systems admin point of view you can make more CPU available for other tasks - but you'll increase the amount of time taken for a PHP process to handle a request.

And of course if requests are coming in at steady rate, taking longer to process them means that there will be more process working (or waiting to be scheduled) at any one time, and the larger memory footprint means less memory available for VFS caching. There's also the risk that the scheduler will start pre-empting tasks rather than letting them yeild which decreases overall throughput.

Hence you should be aiming for high CPU usage! (but low load).

Memory limit in php.ini is 128mb

This is rather high, but since the CPU usage is high that indicates it's not too much of a problem UNLESS each request is taking a long time to complete and doing a lot of garbage collection - in this scenario, forcing more frequent garbage collection by reducing the memory limit can actually improve performance.

I've had so much downtime lately

For what it's worth I've found APC to be a lot more reliable than Xcache - and that seems to be the trend in what I've read elsewhere. Don't have any experience / data on the Zend Optimizer+ which is due to be bundled with future versions of PHP.

I'm verging on setting up Apache again and I'd love to stick with this

Pre-fork apache + mod_php is significantly faster than nginx + php-fpm at lower traffic volumes - but nginx seems to have a massive advantage when the load / request rate is medium - high relative to the hardware capacity (which is where you seem to be).

If it were me I'd be looking hard at the logs to see if the spikes are due to an increase in the traffic volume / a change in the traffic profile / a change in the response time. Also looking hard at caching and code. You might consider temporarily running a profiler against the live site (use xhprof - not xdebug for a production system).

Wow, what an answer! That first part about the amount of work a process is doing clears up a lot. Fewer processes will mean higher individual CPU, so reducing them is not necessarily a good idea. Will watch logs, so far the spikes have been unrelated to traffic. Since increasing my max numbers yesterday (edit #3) I have seen a significant improvement and that just goes with what you were saying. Thanks! — Christian, Mar 13 '13 at 22:52
Symcbean it looks like I might be getting hammered by bots (Google) at high traffic times. So, just snowballs. I get lots and lots of processes using roughly 9% cpu. — Christian, Mar 15 '13 at 23:21

NGINX/PHP-FPM cpu spikes and runaway processes

1 Answers1