0

I'm running Apache/2.4.7 on m4.larges in AWS behind ELBs. The overwhelming majority of my traffic is simple one off requests from naive clients, not web pages. The configuration is old and time tested, but two weeks ago I migrated the deployment to a VPC with no substantive changes to Apache configuration, ELB configuration, or the application itself.

Since then, I have been experiencing a few problems at the Apache and ELB level.

1) Very cyclical spikes in latency -- sometimes going so far as to lead ELB to remove 'unhealthy hosts.' The ELB latency graphs are usually around ~20ms but will spike to 5 or more seconds for periods of a minute or two. Surge queue length and 504s both show up.

2) There's a monitoring process on each server that requests the server-status page every minute (it's only exposed to the localhost); more or less every fifteen minutes that simple request for the server-status page has been timing out (it's .5 second read timeout, but that seems like more than enough for such a simple request).

3) The scoreboard fills up with keep alives and writes very rapidly. Example scoreboard around the time of one of the above timeouts:

400 requests currently being processed, 0 idle workers

KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK
KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK
KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKWKKKKKKKKKKK
KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK
KKKKKKKKKKWKKKKKKKKKKKKKKKKKKKKKKKKKWKKKKKKKKKKKKKKKKKKKKKKKKKKK
KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK
KKKKKKKK

4) The tomcats also experience spikes in RUNNABLE threads which propagate all the way through to the database layer.

Although it has the least impact of any, (2) is maybe the most concerning to me -- it seems crazy that Apache should get into a situation where it can't even handle a request for its own status that doesn't even enter the application proper.

Through all of this, hardware is relatively unstressed -- CPU is rarely more than 50% on webservers. That would suggest to me that this is a configuration problem related to threading, but the configuration has stayed largely constant -- leaving me at a loss.

Relevant Configuration:

ELBs (where SSL terminates) have an idle timeout of 60 seconds.

Each Apache is configured more or less as follows:

Timeout 600
KeepAlive On
MaxKeepAliveRequests 100
KeepAliveTimeout 60

I use the mpm_worker module with:

StartServers             2
MinSpareThreads      100
MaxSpareThreads      300
ThreadLimit          64
ThreadsPerChild      50
MaxRequestWorkers     400
MaxConnectionsPerChild   0

The tomcat servlets are forwarded to via AJP, where they each listen with 300 threads, complemented by a maximum of 300 db connections.

What about this set up is driving my problems?

jwilner
  • 121
  • 6

0 Answers0