0

This is a busy web server running Apache 2.4. Recently we have noticed several brief outages, where the server is timing out on incoming requests for a short period of time (less than a minute). Looking at the Apache error log, there are a number of instances of this error correlated with each outage:

[Thu Dec 19 12:37:21.104416 2019] [http2:info] [pid 10827:tid 46937350948608] [client <ip address>:62355] h2_stream(209-285,HALF_CLOSED_REMOTE): redo, added to q

Where is always the same address, not within our organization. Specifically, we get this burst of errors in the log immediately after a series of request timeouts.

So, somehow this IP is hitting our server with a burst of requests that are triggering this error, and this is somehow preventing the server from responding to other, unrelated requests from other IPs. This includes curl requests being made from the server to itself, the failures of which were how we noticed this happening. (These are API calls being made between our separate websites, which are currently running on the same server.)

Most recently for example, we had a number of API requests time out between 20:36:35 and 20:37:16 UTC. (These are on a 10s timeout since they normally return near instantly, so the first timed out requests started at 20:36:25.) Then we got 2 copies of the above error in the error log at 20:37:06, and then 76 copies with timestamps ranging from 20:37:16 to 20:37:29.

76 requests isn't nearly enough to use up any resource I'm aware of, such as Apache threads or TCP connections. And obviously the error appears to relate to http2 somehow - maybe some kind of SSL negotiation resource that normally is used very briefly, and so some small number of instances can normally handle our traffic level? But obviously this is exposing a vulnerability with our setup, if a single IP can effectively DOS our server with so little traffic.

So my questions are, what does this error mean, is there a configuration setting I can increase to mitigate what appears to be a relatively low limit, and how can we prevent an IP from effectively DOS'ing our server with these type of requests?

Nathan Stretch
  • 181
  • 2
  • 15

1 Answers1

0

It turns out this was due to PHP-FPM's dynamic process creation. During quick spikes of traffic to the server that required additional PHP-FPM processes to be started, it was getting bogged down starting the new processes, and dropping requests. Changing PHP-FPM's pool configuration from dynamic to static solved the problem (at the cost of somewhat higher idle memory usage, as each idle PHP-FPM process uses about 500kB). Alternatively we could have solved the problem by significantly increasing the min spare servers setting, but decided it wasn't worth it for the small amount of memory saved.

Nathan Stretch
  • 181
  • 2
  • 15
  • May have spoken too soon. We're running into the same issue again, although it takes a larger burst to trigger it. So switching PHP-FPM to static definitely helped, but there's still something causing this error during large request spikes. – Nathan Stretch Feb 05 '20 at 07:30
  • Looks like it was caused by legit timeouts on the php side now. All the half_closed_remote errors are preceded by fcgi timeout errors now. So similar to the original issue, when the connection to php-fpm fails in whatever way, be it not able to generate processes fast enough, or timing out due to processing within php, it appears to have the potential to cause this apache error when using http2. – Nathan Stretch Feb 05 '20 at 09:06