Recently we've been experiencing issues with our Varnish (3x) -> Apache (3x) setup, resulting in a huge spike in SYN_SENT
connections.
The spike itself is due to the amount of new traffic hitting the site (not a DDOS of any kind), and it seems like our Varnish machines are having problems forwarding traffic to the backend servers (drop on Apache traffic correlates to spikes on the varnishes), congesting the available ports pool with SYN_SENT
.
Keep-alives are enabled on Apache (15s).
Which side is the fault on? The amount of traffic is significant, but by no amount should it cause such a setup (3x Varnish frontend machines, 3x backend Apache servers) to stall.
Please help.
Munin screenshot for connections through firewall is here.
Varnish
~$ netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
9 CLOSE_WAIT
12 CLOSING
718 ESTABLISHED
39 FIN_WAIT1
1714 FIN_WAIT2
76 LAST_ACK
12 LISTEN
256 SYN_RECV
6124 TIME_WAIT
/etc/sysctl.conf (Varnish)
net.ipv4.netfilter.ip_conntrack_max = 262144
net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_recv = 60
net.ipv4.ip_local_port_range = 1024 65536
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem=4096 87380 16777216
net.ipv4.tcp_wmem=4096 65536 16777216
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 0
net.ipv4.tcp_fin_timeout = 30
Apache
netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
11 CLOSE_WAIT
286 ESTABLISHED
38 FIN_WAIT2
14 LISTEN
7220 TIME_WAIT
/etc/sysctl.conf (Apache)
vm.swappiness=10
net.core.wmem_max = 524288
net.core.wmem_default = 262144
net.core.rmem_default = 262144
net.core.rmem_max = 524288
net.ipv4.tcp_rmem = 4096 262144 524288
net.ipv4.tcp_wmem = 4096 262144 524288
net.ipv4.tcp_mem = 4096 262144 524288
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_max_syn_backlog = 16384
net.ipv4.tcp_keepalive_time = 30
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.all.rp_filter = 0
net.core.somaxconn = 2048
net.ipv4.conf.lo.arp_ignore=8
net.ipv4.conf.all.arp_ignore=1
net.ipv4.conf.all.arp_announce=2
vm.swappiness = 0
kernel.sysrq=1
kernel.panic = 30