Why is mod_proxy_protocol or ELB causing high apache worker count?

Question

We have a legacy cluster of servers running Apache 2.4 that run our application sitting behind an ELB. This ELB has two listeners, one HTTP, and one HTTPS which terminates at the ELB and sends regular HTTP traffic to the instances behind it. This ELB also has pre-open turned off (it was causing a busy worker buildup). Under normal load we have 1-3 busy workers per instance.

We have a new cluster of servers we are trying to migrate to behind a new ELB. The purpose of this migration is to allow for SNI – serving TLS traffic to thousands of domains. As such this cluster uses mod_proxy_protocol which has been enabled at the ELB level. For the purposes of testing we’ve been weighting traffic at the DNS (Route 53) level to send 30% of our traffic to the new load balancer. Under even this small load we see 5 – 10 busy workers and that grows as traffic does.

As a further test we took one of these new instances, disabled proxy_protocol, and moved it from the new ELB to the old ELB, the worker count drops to average levels, being 1-3 busy workers. This seems to indicate that there is an issue wither with the ELB (differences between HTTP and TCP handling?) or mod_proxy_protocol.

My question: Why is it that we have twice the busy apache workers when using proxy protocol and the new ELB? I would think that since TCP listeners are dumb and don’t do any processing on the traffic, they would be faster and as a result consume less workers time than HTTP listeners which actively ‘modify’ the traffic going thru them.

Any guidance to help us diagnose this issue is appreciated.

score 0 · Accepted Answer · answered Feb 01 '17 at 02:21

The difference is simple and significant:

An ELB in HTTP mode takes care of holding the idle keep-alive connections from browsers without holding open corresponding connections to the instance. There's no necessary correlation between browser connections and back-end connections -- a backend connection can be reused.

In TCP mode, it's 1:1. It has to be, because the ELB can't reuse a back-end connection for different browser connection on the front-end -- it's not interpreting what's going down the pipe. That's always true for TCP, but if the reason isn't intuitive, it should be particularly obvious with the proxy protocol enabled. The PROXY "header" is not in fact a "header" in the usual sense -- it's a preamble. It can only be sent at the very beginning of a connection, identifying the source address and port. The connection persists until the browser or server closes it, or it times out. It's 1:1.

This is not likely to be viable with Apache.

Back to HTTP mode, for a minute.

This ELB also has pre-open turned off (it was causing a busy worker buildup).

I don't know how you did that -- I've never seen it documented, so I assume this must have been through a support request.

This seems like a case of solving entirely the wrong problem. Instead of having a number of connections that seems to you to be artificially high, all you've really accomplished is keeping the number of connections artificially low -- ultimately, you're probably actually impairing your performance and ability to scale. Those spare connections are for the purpose of handling bursts of demand. If your instance is too small to handle them, then I would suggest that the real problem is just that: your instance is too small.

Another approach -- which is exactly the solution I use for my dreaded legacy Apache-based applications (one of which has a single Apache server sitting behind a total of about 15 to 20 different ELBs -- necessary because each ELB is offloading SSL using a certificate provided by one of the old platform's customers) -- is HAProxy between the ELBs and Apache. HAProxy can handle literally hundreds of connections and millions of requests per day on tiny instances (I'm talking tiny -- t2.nano and t2.micro), and it has no problem keeping the connections alive from all of the ELBs yet closing the Apache connection after each request... so it's optimizing things in both directions.

And of course, you can also use HAProxy with a TCP balancer and the proxy protocol -- the author of HAProxy was also the creator of the proxy protocol standard. You can also just run it on the instances with Apache rather than on separate instances. It's lightweight in memory and CPU and doesn't fork. I'm not affilated with the project, other than having submitted occasional bug reports during the development of the Lua integration.

Thank you for the insight. When laid out like that it makes perfect sense that the browser is holding the worker open and that ELB would have mitigated that for us when doing HTTP routing. We have ~18 servers behind the ELB and turned keep-alive off believing that that chances of any worker’s connection getting reused at the instance level was slim. We’ll look at HAProxy as suggested and we’ll also re-evaluate the way we treat connection reuse – if you have any advice we’d be eager to hear it. — Andrew, Feb 01 '17 at 19:22
Additionally: The pre-open worker buildup issue was handled with a support ticket. We were never able to diagnose the issue because we didn’t have the time to put into it given that turning preopen on/off takes a support ticket. It happened when we migrated to VPC from EC2 classic (and at the same time from apache 2.2 to 2.4). The VPC ELB caused workers to build in a R “reading request” state, and rarely closed. Eventually all our workers would be in an R state and Apache would start refusing further traffic regardless of how many max_workers we set. AWS support nor Googling turned up anything — Andrew, Feb 01 '17 at 19:24
That's interesting about Apache getting stuck indefinitely in "reading request." Apache should have closed the connections, or the ELB should eventually have closed them (maybe after the timeout). My standard config for HAProxy machines uses `timeout http-request 60000` (60 seconds), after which the proxy closes these extra connections. I have been using HAProxy for a while, now, and that this point I don't actually recall how I arrived at this value or whether I've experimented to see what happens if this is set higher on my side. I may need to try that out of curiosity. — Michael - sqlbot, Feb 01 '17 at 22:24

Why is mod_proxy_protocol or ELB causing high apache worker count?

1 Answers1