I am monitoring a host with the help of Zabbix and I noticed that Zabbix agent started using quite a lot of CPU cycles:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26774 zabbix 20 0 68428 1312 752 R 99 0.0 63:27.67 /usr/sbin/zabbix_agentd
26773 zabbix 20 0 68428 1324 764 R 99 0.0 63:26.33 /usr/sbin/zabbix_agentd
There are about 100 items monitored with the agent. They are also monitored on other identical hosts where Zabbix agent does not consume so much of CPU. Agents send collected data to Zabbix proxy. The agent configuration is default. The host CPU has 8 cores (2.4 Gz). The smallest time value for monitored items is 60 seconds.
I use Zabbix server / agent 1.8.11 and I can't upgrade to 2.2 at least now.
I checked debug log from all sides: Zabbix server, proxy, agent and can't find any issues there. Just usual checks received and sent all of the time.
I don't know how to investigate this issue further and asking for community help. How could I trace why agent is consuming CPU so hard?
One more thing that is looking strange for me is stats of the network connections:
netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
2 CLOSE_WAIT
21 CLOSING
3521 ESTABLISHED
2615 FIN_WAIT1
671 FIN_WAIT2
1542 LAST_ACK
14 LISTEN
256 SYN_RECV
117841 TIME_WAIT
Thank you.
Update 1.
netstat -tnp|grep zabbix
tcp 1 0 10.120.0.3:10050 10.128.0.15:53372 CLOSE_WAIT 23777/zabbix_agentd
tcp 1 0 10.120.0.3:10050 10.128.0.15:53970 CLOSE_WAIT 23775/zabbix_agentd
tcp 1 0 10.120.0.3:10050 10.128.0.15:53111 CLOSE_WAIT 23776/zabbix_agentd
10.128.0.15 - IP of Zabbix server 10.120.0.3 - IP of Zabbix host
Update 2.
Those TIME_WAIT connections are from web server nginx.
Update 3.
I attached to the Zabbix agent process with strace and it appeared that 100% is used by agents on the read syscall
:
strace -C -f -p 23776
Process 23776 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 2.175528 2515 865 read
------ ----------- ----------- --------- --------- ----------------
100.00 2.175528 865 total
Update 4.
Just to get all things clear... I tried to work with the TIME_WAIT connections state. For example, I tried decreasing net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait
and net.netfilter.nf_conntrack_tcp_timeout_time_wait
and see if it helps. Unfortunately, it did not help.
Conclusion
The Zabbix agent CPU load issue appeared to be bound with the network connections number. If we attach to the zabbix_agentd process using strace, we will see how CPU cycles are used (1-st column - CPU time spent running in the kernel):
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 15.252232 8646 1764 read
0.00 0.000000 0 3 write
0.00 0.000000 0 1 open
...
------ ----------- ----------- --------- --------- ----------------
100.00 15.252232 1778 total
Here most of the CPU time is used on the read system calls. Further investigation showed that these read calls (2 of them are shown below) are continious attempts to read the /proc/net/tcp
file. The latter contains network statistic such as TCP and UDP connections, sockets, etc. In average the file contains 70000-150000 entries.
8048 0.000068 open("/proc/net/tcp", O_RDONLY) = 7 <0.000066>
8048 0.000117 fstat(7, {st_dev=makedev(0, 3), st_ino=4026531993, st_mode=S_IFREG|0444, st_nlink=1, st_uid=0, st_gid=0, st_blksize=1024, st_blocks=0, st_size=0, st_atime=2013/04/01-09:33:57, st_mtime=2013/04/01-09:33:57, st_ctime=2013/04/01-09:33:57}) = 0 <0.000012>
8048 0.000093 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f30a0d38000 <0.000033>
8048 0.000087 read(7, " sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode "..., 1024) = 1024 <0.000091>
8048 0.000170 read(7, " \n 6: 0300810A:0050 9275CE75:E67D 03 00000000:00000000 01:00000047 0000000"..., 1024) = 1024 <0.000063>