Nginx monitoring script so-called ztc failing to load nginx test page (mostly under highest load to nginx about 2000 rps, which used as proxy), causing errors like "nginx is down" on zabbix, and, in a second, everything seems to be OK.
[NginxStatus] 2015-12-16 20:24:55,289 - ERROR: failed to load test page
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/ztc/nginx/__init__.py", line 56, in _read_status
u = urllib2.urlopen(url, None, 1)
File "/usr/lib64/python2.6/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib64/python2.6/urllib2.py", line 391, in open
response = self._open(req, data)
File "/usr/lib64/python2.6/urllib2.py", line 409, in _open
'_open', req)
File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib64/python2.6/urllib2.py", line 1190, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib64/python2.6/urllib2.py", line 1165, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
As it happens only under highest load, about 2000 rps, I'm associating this to some kernel parameters, which are causing this.
Here's nginx configuration:
user nginx;
worker_processes 4;
timer_resolution 100ms;
worker_priority -15;
worker_rlimit_nofile 200000;
error_log /var/log/nginx/error.log;
pid /var/run/nginx.pid;
events {
worker_connections 65536;
use epoll;
multi_accept on;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
server_tokens off;
access_log /var/log/nginx/access.log;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
# keepalive_requests 120;
# keepalive_timeout 65;
gzip on;
gzip_http_version 1.0;
gzip_comp_level 2;
gzip_proxied any;
gzip_vary off;
gzip_types text/plain text/css application/x-javascript text/xml application/xml application/rss+xml application/atom+xml text/javascript application/javas$
ript application/json text/mathml;
gzip_min_length 1000;
gzip_disable "MSIE [1-6]\.";
variables_hash_max_size 1024;
variables_hash_bucket_size 64;
server_names_hash_bucket_size 64;
types_hash_max_size 2048;
types_hash_bucket_size 64;
include /etc/nginx/conf.d/*.conf;
include /etc/nginx/sites-enabled/*;
}
Here's sysctl.conf
net.ipv4.conf.all.secure_redirects=0
net.ipv4.conf.all.send_redirects=0
net.ipv4.tcp_max_syn_backlog=20480
net.ipv4.tcp_synack_retries=2
net.ipv4.tcp_rmem=4096 87380 16777216
net.ipv4.tcp_wmem=4096 65536 16777216
net.netfilter.nf_conntrack_max=1048576
net.nf_conntrack_max=1048576
net.ipv4.tcp_no_metrics_save=1
net.ipv4.tcp_tw_reuse=1
net.core.somaxconn=15000
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.ipv4.tcp_keepalive_time=60
net.ipv4.tcp_keepalive_intvl=15
net.ipv4.tcp_keepalive_probes=5
net.ipv4.tcp_max_tw_buckets=720000
net.ipv4.tcp_tw_recycle=1
net.ipv4.tcp_timestamps=1
net.ipv4.tcp_fin_timeout=30
And netstat output:
netstat -an | grep -e :80 -e :443 |awk '/^tcp/ {A[$(NF)]++} END {for (I in A) {printf "%5d %s\n", A[I], I}}'
18525 TIME_WAIT
1 CLOSE_WAIT
499 FIN_WAIT1
1544 FIN_WAIT2
33311 ESTABLISHED
563 SYN_RECV
7 CLOSING
294 LAST_ACK
3 LISTEN
What could be the root cause of this? Are netstat metrics abnormal for 2000rps? Is there a mistake in my sysctl.conf, which's leading to my problem?