0

We are experiencing a high level of connections entering the TIME_WAIT state on our proxy server during a time window that produces bursty traffic - tens of thousands of connections every minute for about 15 minutes.

The proxy server sits behind a firewall that is providing a NAT. Upstream from the firewall is the destination service, which has tcp timestamps enabled, as well as tw_reuse and tw_recycle tcp settings.

The destination service is likely seeing a single IP due to the NAT'd firewall. Our firewall is dropping connections that it is receiving out of state and also, our proxy is producing a lot of 503 errors. The proxy has about 16,000 available ports.

Can anyone help provide a reasoning for this behavior? Based on the info given above, are there settings that we need to enable on the proxy to suppress this behavior?

Freddie
  • 173
  • 5
  • Is it a reverse proxy or a forward proxy? What are the values for the tw_reuse and tw_recycle options? Is the high activity expected (marketing campaign) or not (DDoS) ? – Mindaugas Bernatavičius Jan 22 '18 at 20:59
  • It's a forward proxy. The destination service has tw_reuse and rw_recycle set to 1. The proxy has these options set to 0. The high activity is expected. It is for a live streaming events where user basically join in bursts in a short period of time. – Freddie Jan 23 '18 at 18:40
  • If the options are turned OFF on the proxy, that means TW connections are neither recycled nor reused. This however is not safe to enable in case you have clients that are behind a NAT Router (meaning many people behind 1 public IP). What would be relatively safer is: `net.ipv4.tcp_fin_timeout` shortening. This would shorten the time each connection stays in TW and gets back to the limited pool of available connections. - Qs: do you know that the proxy is not overloaded and capable of handling so many connections? What are CPU, RAM usage @ that time? Any logs that the proxy service generates? – Mindaugas Bernatavičius Jan 23 '18 at 20:26
  • The proxy was at 90% CPU with 25K connections in TW at its peak. So are you saying that the reason so many connections are getting stuck in TW is because the destination service has reuse and recycle enabled but the proxy does not? Why would enabling these be unsafe for a NAT'd IP? We have a config enabled for 2MSL set to 30 seconds currently. The proxy may be undersized, but I'm not sure yet due to all of the connections in TW. – Freddie Jan 24 '18 at 02:02
  • Not really, I'm saying that reuse and recycle help in these cases (when conns. are short), however you need to be sure, that it was not the proxy software going down first and causing excessive TW sockets, because fixing the TCP layer will not change much if the upper layer can't handle it. You can read about the problem w/ NAT'd IPs here: https://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux - it is not a well known issue, I had security engineers enable tw_reuse and tw_recycle w/o even thinking that it is unsafe. In summary: the TCP stack will start sending garbage to NAT'ed IPs. – Mindaugas Bernatavičius Jan 24 '18 at 08:56
  • Unfortunatelly it is hard to advise anything @ this point, since not nearly enough information is there. If you think it's a port exhaustion issue - try increase the range of usable ports. If you can tune 2*MSL, make it 20 seconds. If you can add more CPU, it can help. If you can scale horizontally (add a second proxy) - even better. One more question to answer about the problem - are the connections in TW clients --> proxy or proxy --> backend (firewall in your case) ? – Mindaugas Bernatavičius Jan 24 '18 at 08:59
  • I will read that link you set. Hard to tell, but I'm pretty sure the TW connections are from proxy to firewall. The setup is: client -> firewall -> proxy -> firewall -> destination service. The firewall closest to the destination service is the one that is providing NAT IP. – Freddie Jan 24 '18 at 13:06
  • I should add that the firewall closest to the destination service is getting a lot of "out-of-state" packets and subsequently dropping the connections. I believe the proxy is then seeing this as a 503 error. The firewall CPU was also high during peak - close to 99%. – Freddie Jan 24 '18 at 13:10

0 Answers0