I have a WordPress multi-user site that pegs all of my CPUs at more than 90% usage:
top - 12:02:58 up 55 days, 5:25, 10 users, load average: 20.51, 15.66, 14.90
Tasks: 294 total, 24 running, 270 sleeping, 0 stopped, 0 zombie
Cpu0 : 87.5%us, 8.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 4.5%si, 0.0%st
Cpu1 : 97.9%us, 1.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu2 : 96.0%us, 3.5%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.5%si, 0.0%st
Cpu3 : 97.6%us, 2.1%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu4 : 97.1%us, 2.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu5 : 97.9%us, 1.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu6 : 97.9%us, 1.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.5%si, 0.0%st
Cpu7 : 96.0%us, 3.5%sy, 0.0%ni, 0.3%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 14369424k total, 11903548k used, 2465876k free, 402360k buffers
Swap: 4063200k total, 3594784k used, 468416k free, 1484116k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
30658 apache 16 0 274m 97m 6304 R 62.1 0.7 0:12.49 php-cgi
30686 apache 16 0 213m 92m 6040 R 52.2 0.7 0:03.27 php-cgi
30685 apache 15 0 211m 87m 5764 S 50.3 0.6 0:04.50 php-cgi
28217 apache 16 0 529m 405m 6748 S 49.0 2.9 3:54.72 php-cgi
30468 apache 16 0 414m 291m 6452 R 48.5 2.1 0:49.78 php-cgi
29604 apache 15 0 258m 135m 6464 S 47.4 1.0 2:16.22 php-cgi
28308 apache 16 0 584m 408m 6724 R 43.9 2.9 3:43.07 php-cgi
28266 apache 16 0 550m 374m 6728 R 43.7 2.7 3:58.38 php-cgi
29573 apache 16 0 584m 407m 6592 R 36.8 2.9 1:59.88 php-cgi
30470 apache 16 0 219m 95m 6452 S 36.5 0.7 0:39.66 php-cgi
29138 apache 15 0 513m 334m 6528 S 33.6 2.4 2:03.14 php-cgi
30472 apache 17 0 441m 318m 6272 R 31.7 2.3 0:50.45 php-cgi
28283 apache 16 0 414m 291m 6580 R 29.3 2.1 3:53.06 php-cgi
29858 apache 16 0 251m 127m 6628 R 24.8 0.9 1:15.53 php-cgi
28253 apache 18 0 550m 374m 6580 R 24.5 2.7 4:08.05 php-cgi
30666 apache 15 0 217m 94m 5996 R 24.5 0.7 0:04.68 php-cgi
28208 apache 20 0 584m 407m 6436 R 24.2 2.9 4:36.36 php-cgi
29085 apache 25 0 358m 182m 6488 R 22.6 1.3 2:19.76 php-cgi
28258 apache 25 0 530m 407m 6512 R 22.4 2.9 3:58.70 php-cgi
29574 apache 16 0 530m 406m 6540 S 21.6 2.9 2:19.26 php-cgi
28947 apache 16 0 524m 401m 6476 R 14.1 2.9 2:32.33 php-cgi
28238 apache 15 0 488m 312m 6852 S 12.3 2.2 4:24.34 php-cgi
30464 apache 15 0 274m 151m 6176 R 11.2 1.1 0:19.67 php-cgi
28293 apache 16 0 269m 146m 6460 R 9.9 1.0 3:57.17 php-cgi
28205 apache 25 0 530m 407m 6496 R 9.6 2.9 4:05.49 php-cgi
30471 apache 19 0 263m 140m 6440 R 6.9 1.0 0:47.42 php-cgi
The output shows that the most CPU an individual process uses is ~60%, but there's been times where I've had as many as 7 process using more than 90% cpu.
The site runs as follows:
nginx works as a reverse proxy, serving every static file that it can and caching pages via the proxy_cache directive.
It delegates to Apache when PHP scripts are required. These are run via mod_cgi using the ExecCGI option
Both Apache and nginx do compression on every human-readable file
To avoid hitting MySQL all the time, we save HTML fragments in memcached, which currently caches between 2 and 4MB, as reported by the stats command in a telnet connection
There's also some counters kept in a Redis database, mostly to count page views for every post.
No WP Super Cache (nginx does the caching), no XCache.
I'm at a loss as to how to determine what exactly every php-cgi process is doing to require such a high CPU demand - the site has been heavily modified by several different software teams before we started giving it maintenance.
The PHP errors log shows mostly these errors:
- "Cannot redeclare class FacebookRestClientException"
- "Call to undefined function e_()"
- Invalid SQL syntax, mostly here: "WHERE post_id = xxxxx AND blog_id = "
- "Allowed memory size of 268,435,456 bytes exhausted"
- "Call to undefined method Services_JSON::encodeUnsafe()"
None of these actually perform any computation, so they can't be the source of the cpu problem.
I tried tracing system calls and saw lstat, read, write and access, which would generate waiting and not cpu load were they the problem (correct?). Also, there were calls to both poll and select.
Could someone give me pointers as to what to check next?