3

I'm trying to manage a server on Amazon for a network of sites that receives about 100 million pageviews per month. Unfortunately, nobody out of my team of 5 developers has much server admin experience.

Right now we have the MaxClients set to 1400. Currently our traffic is about average, and we have 1150 total Apache processes running, which use about 2% CPU each! Out of those 1150, 800 of them are currently sleeping, but still taking up CPU. I'm sure there are ways to optimize this. I have a few thoughts:

  1. It appears Apache is creating a new process for every single connection. Is this normal?
  2. Is there a way to more quickly kill the sleeping processes?
  3. Should we turn KeepAlive on? Each page loads about 15-20 medium-sized graphics and a lot of javascript/css.

So, here's our Apache setup. We do plan on contracting a server admin asap, but I would really appreciate some advice until we can find someone.

Timeout 25
KeepAlive Off
MaxKeepAliveRequests 200
KeepAliveTimeout 5

<IfModule prefork.c>
StartServers         100
MinSpareServers      20
MaxSpareServers      50
ServerLimit          1400
MaxClients           1400
MaxRequestsPerChild  5000
</IfModule>

<IfModule worker.c>
StartServers         4
MaxClients           400
MinSpareThreads      25
MaxSpareThreads      75
ThreadsPerChild      25
MaxRequestsPerChild  0
</IfModule>

Full top output:

top - 23:44:36 up 1 day,  6:43,  4 users,  load average: 379.14, 379.17, 377.22
Tasks: 1153 total, 379 running, 774 sleeping,   0 stopped,   0 zombie
Cpu(s): 71.9%us, 26.2%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  1.9%si,  0.0%st
Mem:  70343000k total, 23768448k used, 46574552k free,   527376k buffers
Swap:        0k total,        0k used,        0k free, 10054596k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1756 mysql     20   0 10.2g 1.8g 5256 S 19.8  2.7 904:41.13 mysqld
21515 apache    20   0  396m  18m 4512 R  2.1  0.0   0:34.42 httpd
21524 apache    20   0  396m  18m 4032 R  2.1  0.0   0:32.63 httpd
21544 apache    20   0  394m  16m 4084 R  2.1  0.0   0:36.38 httpd
21643 apache    20   0  396m  18m 4360 R  2.1  0.0   0:34.20 httpd
21817 apache    20   0  396m  17m 4064 R  2.1  0.0   0:38.22 httpd
22134 apache    20   0  395m  17m 4584 R  2.1  0.0   0:35.62 httpd
22211 apache    20   0  397m  18m 4104 R  2.1  0.0   0:29.91 httpd
22267 apache    20   0  396m  18m 4636 R  2.1  0.0   0:35.29 httpd
22334 apache    20   0  397m  18m 4096 R  2.1  0.0   0:34.86 httpd
22549 apache    20   0  395m  17m 4056 R  2.1  0.0   0:31.01 httpd
22612 apache    20   0  397m  19m 4152 R  2.1  0.0   0:34.34 httpd
22721 apache    20   0  396m  18m 4060 R  2.1  0.0   0:32.76 httpd
22932 apache    20   0  396m  17m 4020 R  2.1  0.0   0:37.34 httpd
22933 apache    20   0  396m  18m 4060 R  2.1  0.0   0:34.77 httpd
22949 apache    20   0  396m  18m 4060 R  2.1  0.0   0:34.61 httpd
22956 apache    20   0  402m  24m 4072 R  2.1  0.0   0:41.45 httpd
andrewtweber
  • 449
  • 1
  • 10
  • 18
  • You might want to add the screen output from top. – mdpc Jun 23 '11 at 23:41
  • Not really an answer, but it might help. Apache is great at lots of little requests, if each page is really serving "15-20 medium-sized graphics" I'd strongly suggest offloading those to something like NGINX, or even a CDN like CloudFront. Are you using a small, medium or large EC2 instance? Having that many workers will use a LOT of memory and quite a bit of CPU, even on idle –  Jun 23 '11 at 23:43
  • @mdpc I updated with the full top output. – andrewtweber Jun 23 '11 at 23:45
  • @samarudge We're using the largest possible EC2 instance. Is there a way to decrease the # of workers while still supporting the same amount of traffic? – andrewtweber Jun 23 '11 at 23:45
  • 1
    Are you using pconnect for mysql? Also, I see the system is taking 25% of your cpu time - there's a chance that could be thread contention. At 100 million -pageviews-, your requests per second is pretty decent. Maybe look at increasing MaxRequestsPerChild to double that. Also, do you know if you're using prefork or worker? – thinice Jun 23 '11 at 23:50
  • @andrewtwebber Unfortunately, Apache does that with that larger number of workers. We generally use the rule that the number of Apache servers should be 64*number-of-CPU-cores (I.E. a 4 core server should have 256 servers). Have you installed the status mod so you can see exactly what all the workers are doing? –  Jun 23 '11 at 23:51
  • I would definitely enable keepalive. – thinice Jun 23 '11 at 23:52
  • @andrewtwebber But like I say, if you split your servers into groups, some with Apache for dynamic pages and some with NGINX/LIGHTTPD for static content (Images, JS, CSS etc.) not only will your site be faster for your end users but you can drop the workers in Apache to a more reasonable level while still being able to serve the same requests –  Jun 23 '11 at 23:53
  • 2
    @andrewtwebber You might also want to look at something like upstream VARNISH servers to cache your dynamic pages to further reduce load on your Apache nodes. When it comes down to it, Apache is a terrible webserver for large sites (compared to say, NGINX). Unfortunately there's not really a good way to fully escape it if you're working with PHP (I'm guessing you are) so you just have to try and remove as much load as possible from Apache and offload it to other, better suited software –  Jun 23 '11 at 23:56
  • @andrewtwebber I did some benchmarking on an EC2 `small` instance using seige, with Apache one instance could do about 30 requests/second, with NGINX I got it to 300 requests/second. Unfortunately NGINX only really works with static files (Unless you're proxying to say a ruby application server) –  Jun 23 '11 at 23:57
  • @rovangju Thanks for the advice! We're using prefork and regular connect (not pconnect). I've enabled KeepAlive and upped MaxRequestsPerChild to 10000. – andrewtweber Jun 24 '11 at 00:00
  • @samarudge Thank you as well. We have 8 cpu cores, so should we set ServerLimit to 512? We have the status mod enabled but I don't know how to check the worker status. I'll look into an NGINX server for some of the static content. – andrewtweber Jun 24 '11 at 00:12
  • You'd have to be the judge to how decreasing the server limit would affect your users/clients. Personally going from 1400 to 512 might not help you since you've only got ~<400 concurrent running at any given time. You say you get 100mil -views- a month - do you have an idea of what your approximate requests-per-second are? – thinice Jun 24 '11 at 00:19
  • @rovangju I have no idea :/ But if we only have about 400 running, and limit the max to 512, that means that we'll only have about 100 sleeping processes, right? So the amount of CPU they eat will be decreased immensely, which I think is the main reason our server is loading pages so slowly. – andrewtweber Jun 24 '11 at 00:22
  • @rovangju Showing 0-7. I changed MaxClients to 512 as a test. That brought the idle up to 60% but made the page loading much slower, so I changed it back. I might not be able to reply again after this until tomorrow morning. Thanks again for all the help. – andrewtweber Jun 24 '11 at 00:37
  • When you get back, can you please post the top 20-30 lines of the apache-status page? – thinice Jun 24 '11 at 03:22

2 Answers2

4

Looks to me like you're using the prefork mpm. Most of this answer assumes as much.

It appears Apache is creating a new process for every single connection. Is this normal?

For prefork? Yes.

Is there a way to more quickly kill the sleeping processes?

Are you sure these processes are not doing anything? With your MaxSpareSevers setting you should only have up to 50 idle processes. Enable mod_status and set ExtendedStatus on should allow you to view the Apache scoreboard and allow you to see what is going on.

Should we turn KeepAlive on? Each page loads about 15-20 medium-sized graphics and a lot of javascript/css.

Turning on KeepAlive is a good idea. It will allow clients to pipeline requests and allow you to reuse Apache processes more efficiently.

As with most tuning. Measure first to create you baseline, then change one thing and then re-measure to try determine what effect you change may have had. Using (and graphing) mod_status is handy for this.

You may be able to use the worker mpm, which tends to help with performance. However, some libraries (notably some PHP libraries) do not operate well with the worker mpm. YMMV

To determine which mpm you're using run: apache2 -V (or httpd -V depending on the distro)

agy
  • 206
  • 1
  • 1
  • Yes we are using prefork. Thanks for the reply! I've learned a lot about Apache just from your response. See my comment on mechcow's response about KeepAlive - it didn't seem to work with our particular solution, at least not in the short term. But once we setup separate DB and CDN servers, we'll be able to support more apache processes and turn it on. – andrewtweber Jun 24 '11 at 15:22
4

There are entire books written on this topic, but to keep it simple:

Run the database in a separate tier

Your database workload and webserver workload are entirely different and will be thrashing resources in competing ways. Its best to keep them separate, this will help you scale out in the future.

Isolate static and dynamic content

Consider running a faster webserver like nginx for static content and ditching apache entirely. If you can, run nginx everywhere.

KeepAlive will definitely help

A lot of your resources are being burned by tear-up and tear-down of connections.

PageTest

For more good advice, I highly recommend : http://www.webpagetest.org/ . This will show you why the site is taking a long time to load and has a number of best-practice tips to fixing performance: enabling gzip compression, minifying javascript and css, etc. etc. Give it a read.

hellomynameisjoel
  • 2,172
  • 2
  • 18
  • 23
  • Thanks a lot! Do you have any specific book(s) you'd recommend? – andrewtweber Jun 24 '11 at 13:50
  • When we enabled KeepAlive, it seemed to help page load times - *if* you could get a connection. Unfortunately nearly every block in the mod_status page was a "K", so 90% of the time my connection would timeout waiting for one of the "K" blocks to free up. So for now we've left KeepAlive off and decraesed MaxClients from 1400 to 500. That seems to be helping a lot, but I understand that it's a temporary solution. We're looking into using a separate db server and using a CDN with nginx. I'm accepting another answer because their rep is much lower, but I truly appreciate your response! – andrewtweber Jun 24 '11 at 15:19
  • Wow. We set up an RDS for our mysql server, and things are running 1000x faster! Thanks again :) – andrewtweber Jun 24 '11 at 21:29
  • Check out "Scalable Internet Architectures" by Theo Schlossnagle, its a good book on the topic. Perhaps using Firebug would help you debug why KeepAlive hurt you so much, I guess it depends on your traffic workload. – hellomynameisjoel Jun 24 '11 at 22:05