0

While reloading nginx, I started getting errors in messages log "possible SYN flooding on port 443", and it seems like nginx becomes completely irresponsive at that time (quite for a while), cause zabbix reports "nginx is down" with ping 0s. RPS at that time is about 1800.

But, server stays responsive on the other non-web ports (SSH, etc.)

Where should I look into and what configs (sysctl, nginx) should I show to find the root cause of this.

Thanks in advance.

UPD:

Some additional info:

$ netstat -tpn |awk '/nginx/{print $6,$7}' |sort |uniq -c
   3266 ESTABLISHED 31253/nginx
   3289 ESTABLISHED 31254/nginx
   3265 ESTABLISHED 31255/nginx
   3186 ESTABLISHED 31256/nginx

nginx.conf sample:

worker_processes  4;
timer_resolution 100ms;
worker_priority -15;
worker_rlimit_nofile 200000;

events {
  worker_connections  65536;
  multi_accept on;
  use epoll;
}

http {

  sendfile on;
  tcp_nopush on;
  tcp_nodelay on;

  keepalive_requests 100;
  keepalive_timeout  65;

}

custom sysctl.conf

net.ipv4.ip_local_port_range=1024 65535
net.ipv4.conf.all.accept_redirects=0
net.ipv4.conf.all.secure_redirects=0
net.ipv4.conf.all.send_redirects=0
net.core.netdev_max_backlog=10000
net.ipv4.tcp_syncookies=0
net.ipv4.tcp_max_syn_backlog=20480
net.ipv4.tcp_synack_retries=2
net.ipv4.tcp_syn_retries=2
net.ipv4.tcp_rmem=4096 87380 16777216
net.ipv4.tcp_wmem=4096 65536 16777216
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.netfilter.nf_conntrack_max=1048576
net.ipv4.tcp_congestion_control=htcp
net.ipv4.tcp_timestamps=1
net.ipv4.tcp_no_metrics_save=1
net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_tw_recycle=0
net.ipv4.tcp_max_tw_buckets=1400000
net.core.somaxconn=250000
net.ipv4.tcp_keepalive_time=900
net.ipv4.tcp_keepalive_intvl=15
net.ipv4.tcp_keepalive_probes=5
net.ipv4.tcp_fin_timeout=10

UPD2

Under normal load at about 1800 RPS, when I set backlog on nginx to 10000 on 80 and 443 ports, and then reloaded nginx it became to use more RAM (3.8Gb out of my 4GB instance were used, and some workers were killed by OOM-killer), and with worker_priority at -15 load was over 6 (while my instance has 4 cores only). So, the instance was quite laggy, and I set worker_priority to -5, and backlog to 1000 for every port. For now, it uses less memory, and peak load was 3.8, but, nginx still becomes unresponsive for a minute or two after reload. So, the problem still persists.

Some netstat details:

netstat -tpn |awk '/:80/||/:443/{print $6}' |sort |uniq -c
      6 CLOSE_WAIT
     14 CLOSING
  17192 ESTABLISHED
    350 FIN_WAIT1
   1040 FIN_WAIT2
    216 LAST_ACK
    338 SYN_RECV
  52541 TIME_WAIT
d.ansimov
  • 123
  • 7

1 Answers1

0

If you have:

  keepalive_timeout  65;

I can imagine that it can take a while for connections to get terminated and workers restarted. I am not quite sure without looking in the code if nginx is waiting for them to expire once it gets a reload.

You could try lowering the value and see if it helps.

evilBunny
  • 54
  • 4
  • Setting `keepalive_timeout` to `10` and even to `0` (this turns keepalive connections off) didn't solve the problem, but thank you, mate. – d.ansimov Jun 21 '16 at 13:08
  • hmm looking at it again ... it could be that you are running out of ports on the interface ... and they are stuck in time wait... sysctl net.ipv4.tcp_tw_recycle=1 could relief the presure – evilBunny Jun 21 '16 at 13:35
  • I've read about tcp_tw_recycle much, and it seems like it's not desirable to use it, only as a last resort (though, it seems to be helpful). Traffic on the interface isn't high, so it seems like there must be another way to solve this issue. But thanks, I'll give it a try. – d.ansimov Jun 21 '16 at 14:06
  • I find the main reason why I escaped using it - it [won't handle](https://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux.html) connections from NAT'ed machines. – d.ansimov Jun 21 '16 at 14:19
  • Howmany source ip's are the requests from ? That TIME_WAIT count indicates that keepalive isn't really working or are it 50k different ip's ? – evilBunny Jun 21 '16 at 15:03
  • It's really hard to count, but now under lower load it's about 35000 unique IPs. – d.ansimov Jun 21 '16 at 19:41