My server crashes and I've checked everything: what's left?

Question

I have an m1.medium Amazon EC2 instance running Apache and hosting a Wordpress blog. In turn, Wordpress works with a MySQL database over on a different EC2 instance. The Wordpress site has W3 Total Cache set up and working well, and much of the static content on the site is served from a CDN. The site runs a low amount of traffic regularly and then occasionally gets some huge traffic spikes.... and when those spikes occur (more than ~150 people accessing the site), the site goes down. I can also make this happen every time with with some load testing tools.

Here is the 'top' when the main server is idle:

top - 23:21:23 up 103 days, 19:40,  3 users,  load average: 0.91, 0.60, 0.62
Tasks:  93 total,   1 running,  92 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.9%sy,  0.0%ni, 99.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3844856k total,  1756268k used,  2088588k free,   150132k buffers
Swap:        0k total,        0k used,        0k free,   833740k cached

However, if I do some load tests to simulate hundreds of users accessing a static graphic file (which obviously doesn't trigger Wordpress, PHP or the database), everything is fine: the server load stays low, the graphic file is served quickly, etc.

My Apache settings (3.1G of memory in server / ~8100k per httpd instance = ~400 MaxClients):

    StartServers         5
    MinSpareServers      5
    MaxSpareServers     10
    ServerLimit        400
    MaxClients         400
    MaxRequestsPerChild  0

So based on all that, it seems like the problem has to do with when PHP or MySQL are used.

Over on the MySQL server, no matter what I do, the load stays pretty much at 0, and the slow query log stays empty... so I think things are healthy there. Here is my 'top' for the SQL server:

top - 23:20:21 up 103 days, 19:12,  5 users,  load average: 0.08, 0.03, 0.05
Tasks: 115 total,   1 running, 114 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3844856k total,  1076912k used,  2767944k free,   158412k buffers
Swap:        0k total,        0k used,        0k free,   638092k cached

This all leads me to think that one of these unlikely scenarios is occurring:

The machine simply doesn't have enough raw horsepower to serve 150+ concurrent users, and I should move up to an L or XL EC2 instance. Maybe, but... really? 150 users is too many for a relatively powerful m1.medium server?
Apache simply isn't built to handle this kind of traffic. Doubtful.
There is a problem in the communication between the web and database servers. But I doubt that since this is between two Amazon EC2 instances.

I feel like I've checked everything I can and still no luck. What else should I check? What else can I try?

score 1 · Answer 1 · answered Feb 14 '13 at 23:55

1

An m1.medium's 'horsepower' equates to one cpu in this case, with a load of .91 at your time of showing top. Load of .91 means "right now, 91% of one cpu's work is being requested by processes." In short, something seems to be starving your cpu while you're idle.

Assuming that's the issue, I'd pare down whatever service is eating your cpu. If that's not an option on your main server, I'd make an ami of your existing machine, then spin another two hosts, instance type t1.micro, and ensure only the bare minimum services are running on it (apache in this case.) Then round robin dns your website address. This will effectively triple your burst cpu capacity, while giving 2x cpu baseline.

answered Feb 14 '13 at 23:55

Stephan

999
7
11

Yeah, given that when 'idle' your load is 0.91 and your cpu is 99.1% idle, sounds like you may have I/O issues. You can use `iostat -xm 1` to get a sense of disk usage. You should also check network utilisation - I like `bmon` for this. Remember that all I/O is CPU bound on ec2. – chrskly Feb 15 '13 at 00:04
A load of .91 doesn't mean 99.1% idle, it means 91% of one cpu is being requested. If you had four cpus (or one cpu with four cores) a load of .91 would mean slightly less than a quarter of total cpu resources are being utilized. Further reading: http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages – Stephan Feb 15 '13 at 00:07
Wasn't disagreeing ;) I was saying that given that cpu usage is low (99.1% idle) and that load is high (0.91) it indicates that there might be high I/O. Since load isn't just about CPU usage. – chrskly Feb 15 '13 at 00:13
Doh, my bad; had my head in another place for a minute :) Still, load is an 'average' of cpu demand over a period of time, typically 1/5/15 minutes. There might be processes burning hot for 50 seconds, then go idle for the time he copies the output. – Stephan Feb 15 '13 at 00:16

score 1 · Answer 2 · answered Feb 15 '13 at 00:03

We serve larger # of users over a smaller instance on EC2(and have used apache benchmark for up to 1000 concurrent sessions).

Good news is that you can reproduce it.

There are lot of things you can check:

Put a tool like new relic on both servers(use the free version to log cpu/memory history for starters). And this is good anyway for trending.
Start with communication between 2 servers. Do some file transfers and check the speed.
We had a issue with wp-cron.php killing the wordpress server from time to time. Try disabling that moving it to a cron call every so minutes.

Report back what you find and I can suggest few more ideas.

My server crashes and I've checked everything: what's left?

2 Answers2