I have an m1.medium Amazon EC2 instance running Apache and hosting a Wordpress blog. In turn, Wordpress works with a MySQL database over on a different EC2 instance. The Wordpress site has W3 Total Cache set up and working well, and much of the static content on the site is served from a CDN. The site runs a low amount of traffic regularly and then occasionally gets some huge traffic spikes.... and when those spikes occur (more than ~150 people accessing the site), the site goes down. I can also make this happen every time with with some load testing tools.
Here is the 'top' when the main server is idle:
top - 23:21:23 up 103 days, 19:40, 3 users, load average: 0.91, 0.60, 0.62
Tasks: 93 total, 1 running, 92 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.9%sy, 0.0%ni, 99.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 3844856k total, 1756268k used, 2088588k free, 150132k buffers
Swap: 0k total, 0k used, 0k free, 833740k cached
However, if I do some load tests to simulate hundreds of users accessing a static graphic file (which obviously doesn't trigger Wordpress, PHP or the database), everything is fine: the server load stays low, the graphic file is served quickly, etc.
My Apache settings (3.1G of memory in server / ~8100k per httpd instance = ~400 MaxClients):
StartServers 5
MinSpareServers 5
MaxSpareServers 10
ServerLimit 400
MaxClients 400
MaxRequestsPerChild 0
So based on all that, it seems like the problem has to do with when PHP or MySQL are used.
Over on the MySQL server, no matter what I do, the load stays pretty much at 0, and the slow query log stays empty... so I think things are healthy there. Here is my 'top' for the SQL server:
top - 23:20:21 up 103 days, 19:12, 5 users, load average: 0.08, 0.03, 0.05
Tasks: 115 total, 1 running, 114 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 3844856k total, 1076912k used, 2767944k free, 158412k buffers
Swap: 0k total, 0k used, 0k free, 638092k cached
This all leads me to think that one of these unlikely scenarios is occurring:
- The machine simply doesn't have enough raw horsepower to serve 150+ concurrent users, and I should move up to an L or XL EC2 instance. Maybe, but... really? 150 users is too many for a relatively powerful m1.medium server?
- Apache simply isn't built to handle this kind of traffic. Doubtful.
- There is a problem in the communication between the web and database servers. But I doubt that since this is between two Amazon EC2 instances.
I feel like I've checked everything I can and still no luck. What else should I check? What else can I try?