Medium traffic causes wordpress server to require a hard reboot

Question

We are having trouble with our rackspace wordpress server falling down with medium traffic after an email send.

The server specs are:

CPU            2 vCPUs
RAM            2 GB
System Disk   80 GB
Network      240 Mb / s
Disk I/O    Good

Running:

Centos       7.0
Wordpress  4.3.1
Httpd      2.4.6
PHP       5.4.11
MariaDB   5.5.41

The installation is all fairly standard as far as I can tell and the database is pretty standard, indexed and fairly small. We are also wordpress object caching.

According to New Relic; during normal traffic, the site spends about 80% of the time in PHP, 15% of the time in web external and only a small percentage in the database. Average standard page app time is around 800ms, which does seem slow to me.

Running a load test of 250 connections in 1 minute causes the connections to take progressively longer and then start timing out after about 30, and the server to become unresponsive (even when traffic dies back down). It requires a hard reboot to become active again.

I can't connect using putty and the home page oscillates between timing out and returning the dreaded 'Error Establishing Database Connection'.

Using the rackspace monitoring agent on the most recent test it appears that the CPU is maxing at 100% just before death, the memory used is peaking at about 1.6GB with free dropping to about 100MB. It looks like about 2GB of Swap Memory (total 4GB) is being used too. Standard usage appears to be about 15% CPU, 800MB memory and 400MB swap.

Our Apache config doesn't set any of the following (no files in /etc do); Timeout, KeepAlive, MaxKeepAliveRequests, KeepAliveTimeout; so I'm guessing it is using the default values.

I've looked at mariadb settings:

innodb_buffer_pool_size = 1400M
max_user_connections = 0

Which don't seem to be the cause.

I've also turned on the performance_schema, but I don't really know what I'm looking for. I'm not even sure the DB is the problem.

I'm tempted to upgrade the instance, but I'd rather have a clearer view of where the bottleneck is and what is causing the server to die rather than just slow down.

Any ideas on where to start? There seem to be lots of possible tweaks out there and a lot of information.

What's your PHP version? How does your `httpd-default.conf` look? What's your peak memory usage for the average WordPress page? How fast does the page load under low traffic? — MonkeyZeus, Oct 08 '15 at 14:55
@MonkeyZeus Thanks, I have included the PHP version above and the average load time. How do I get the peak memory usage for an average page? And I only have an `httpd.cnf` file which is quite large. Which bits should I share? — Arth, Oct 08 '15 at 15:15
I am not sure if WordPress funnels and routes everything through one `index.php` but basically you would want to use [**`memory_get_peak_usage()`**](http://php.net/manual/en/function.memory-get-peak-usage.php) at the very end of your script execution and log it to the DB. 800 ms is actually quite slow; how much of that is server execution time vs. network latency? Don't worry about the conf file for now. If you are able to then I highly recommend upgrading to PHP 5.6.14. I went from 5.3.5 to 5.6.3 and saw a noticeable speed boost. — MonkeyZeus, Oct 08 '15 at 15:42
If you really want to dive into the Apache configuration then you should look into these: Timeout, KeepAlive, MaxKeepAliveRequests, KeepAliveTimeout. I would bet that your Apache instance is running out of memory and is resorting to HDD swapping so your performance tanks; not sure why it doesn't recover though. Once your server takes a nose-dive, does it go back to normal after an hour? Have you given it this much time? — MonkeyZeus, Oct 08 '15 at 15:48
@MonkeyZeus I believe that's all on the server.. it matches up with TTFB (waiting). Cool, will look in to the memory, although my total memory stats for the server don't seem to be doing anything mental before the server died (I guess the Apache memory use could be hidden? I seem to remember it allocates a chunk of memory and then slowly uses it up). — Arth, Oct 08 '15 at 15:52
@MonkeyZeus I will have a look at those settings and post them up. I have also just noticed a lot of these errors in the log `PHP Fatal error: Uncaught exception 'RedisException' with message 'read error on connection'` repeatedly following the nose-dive and up until reboot — Arth, Oct 08 '15 at 15:56
@MonkeyZeus It's a production server, so I can't really afford to leave it down for an hour, I'm pretty sure the first time it happened, it stayed down for over half an hour before I rebooted it. — Arth, Oct 08 '15 at 15:57
Yep, TTFB is definitely what I was inquiring about. What is your free RAM space during low or no activity? I don't remember my settings but MySQL on my dev box, Win7, with zero traffic takes 1GB — MonkeyZeus, Oct 08 '15 at 15:57
Sounds like a CPU bottleneck. Run `htop` or `top` during a load test to confirm it. — Michael Hampton, Oct 08 '15 at 15:58
I don't expect you to leave your production server down for an hour, I was just curious. It's kind of baffling as to why it didn't recover after half an hour. — MonkeyZeus, Oct 08 '15 at 15:58
I see, for the time being see if you can get any of my other questions answered. I think it will help to track down the source of the issue. Definitely track down that Redis error and fix it or remove the use of Redis because, in general, it takes longer for a connection to fail than it does to succeed. — MonkeyZeus, Oct 08 '15 at 16:12
@MonkeyZeus I can't find mention of those Apache config options in `/etc` so I'm guessing they are all set to the defaults. — Arth, Oct 08 '15 at 16:13
@MonkeyZeus Cool, will look into them all and update the question where appropriate — Arth, Oct 08 '15 at 16:14
@MonkeyZeus yep, sorry for the slow response.. turns out it was the CPU and memory. I put the opcache on and upgraded the memory. It seems to be able to handle about 10 times as much traffic now without falling over. Thanks for all your help! — Arth, Oct 12 '15 at 11:29

score 1 · Accepted Answer · answered Oct 09 '15 at 11:17

Close monitoring during any sort of event is crucial. As we see, the truth came out:

Using the rackspace monitoring agent on the most recent test it appears that the CPU is maxing at 100% just before death, the memory used is peaking at about 1.6GB with free dropping to about 100MB. It looks like about 2GB of Swap Memory (total 4GB) is being used too. Standard usage appears to be about 15% CPU, 800MB memory and 400MB swap.

PHP is well known to be rather CPU intensive. You've used all of the available CPU and nearly all of the available RAM.

You should first take steps to deal with that, such as opcode caching (e.g. Zend OPcache) and file caching (e.g. W3 Total Cache WordPress plugin). If those don't help enough, then it's time to upgrade the instance.

mc0e · Answer 2 · 2015-10-09T12:06:16.033

You are probably just running too many processes at once, running out of memory, and churning swap. It may be that something else is locking up, but deal with this first and then see where you are at.

You haven't told us whether you are using mod_php or something like php-fpm. The latter well handle load better, but in either case make sure you don't run more php processes than you have memory for. You probably won't get any performance benefit out of running more than 5 or 10 processes, but default running of mod_php in particular will run much more than you have memory for. Also, recycle processes every 30 or so requests. If you give 1GB to your database and OS, then your other GB probably won't handle 10 WordPress processes. Look how much memory they take and work it out, with a little clearance. You shouldn't be using any swap in the normal course of things.

Look at your keep-alive settings. With Apache you are probably best to either turn it off, or set it to 1 second. Nginx handles keep-alive much better. In fact this is the only really important reason why nginx it's likely to perform better with a php application like WordPress (though it comes at the cost of less pleasant configuration). This very likely isn't a factor with your testing, but is important with real browsers.

100% CPU surprises me. Use top to see what is using it. Remember also that 100% often means 100% of one core. You may just be seeing one cron job kicking in, which with WordPress is typically not "cron" as such, but rather jobs get run as an extra while processing web requests. Lack of opcode caching could also be causing high cpu usage.

Medium traffic causes wordpress server to require a hard reboot

2 Answers2