3

I am monitoring a host with the help of Zabbix and I noticed that Zabbix agent started using quite a lot of CPU cycles:

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                         
26774 zabbix    20   0 68428 1312  752 R   99  0.0  63:27.67 /usr/sbin/zabbix_agentd                                                                                                                         
26773 zabbix    20   0 68428 1324  764 R   99  0.0  63:26.33 /usr/sbin/zabbix_agentd

There are about 100 items monitored with the agent. They are also monitored on other identical hosts where Zabbix agent does not consume so much of CPU. Agents send collected data to Zabbix proxy. The agent configuration is default. The host CPU has 8 cores (2.4 Gz). The smallest time value for monitored items is 60 seconds.

I use Zabbix server / agent 1.8.11 and I can't upgrade to 2.2 at least now.

I checked debug log from all sides: Zabbix server, proxy, agent and can't find any issues there. Just usual checks received and sent all of the time.

I don't know how to investigate this issue further and asking for community help. How could I trace why agent is consuming CPU so hard?

One more thing that is looking strange for me is stats of the network connections:

netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
      2 CLOSE_WAIT
     21 CLOSING
   3521 ESTABLISHED
   2615 FIN_WAIT1
    671 FIN_WAIT2
   1542 LAST_ACK
     14 LISTEN
    256 SYN_RECV
 117841 TIME_WAIT

Thank you.

Update 1.

netstat -tnp|grep zabbix

    tcp        1      0 10.120.0.3:10050        10.128.0.15:53372        CLOSE_WAIT  23777/zabbix_agentd
    tcp        1      0 10.120.0.3:10050        10.128.0.15:53970        CLOSE_WAIT  23775/zabbix_agentd
    tcp        1      0 10.120.0.3:10050        10.128.0.15:53111        CLOSE_WAIT  23776/zabbix_agentd

10.128.0.15 - IP of Zabbix server 10.120.0.3 - IP of Zabbix host

Update 2.

Those TIME_WAIT connections are from web server nginx.

Update 3.

I attached to the Zabbix agent process with strace and it appeared that 100% is used by agents on the read syscall:

strace -C -f -p 23776

Process 23776 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    2.175528        2515       865           read
------ ----------- ----------- --------- --------- ----------------
100.00    2.175528                   865           total

Update 4.

Just to get all things clear... I tried to work with the TIME_WAIT connections state. For example, I tried decreasing net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait and net.netfilter.nf_conntrack_tcp_timeout_time_wait and see if it helps. Unfortunately, it did not help.

Conclusion

The Zabbix agent CPU load issue appeared to be bound with the network connections number. If we attach to the zabbix_agentd process using strace, we will see how CPU cycles are used (1-st column - CPU time spent running in the kernel):

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00   15.252232        8646      1764           read
  0.00    0.000000           0         3           write
  0.00    0.000000           0         1           open
...
------ ----------- ----------- --------- --------- ----------------
100.00   15.252232                  1778           total

Here most of the CPU time is used on the read system calls. Further investigation showed that these read calls (2 of them are shown below) are continious attempts to read the /proc/net/tcp file. The latter contains network statistic such as TCP and UDP connections, sockets, etc. In average the file contains 70000-150000 entries.

8048       0.000068 open("/proc/net/tcp", O_RDONLY) = 7 <0.000066>
8048       0.000117 fstat(7, {st_dev=makedev(0, 3), st_ino=4026531993, st_mode=S_IFREG|0444, st_nlink=1, st_uid=0, st_gid=0, st_blksize=1024, st_blocks=0, st_size=0, st_atime=2013/04/01-09:33:57, st_mtime=2013/04/01-09:33:57, st_ctime=2013/04/01-09:33:57}) = 0 <0.000012>
8048       0.000093 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f30a0d38000 <0.000033>
8048       0.000087 read(7, "  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode    "..., 1024) = 1024 <0.000091>
8048       0.000170 read(7, "                         \n   6: 0300810A:0050 9275CE75:E67D 03 00000000:00000000 01:00000047 0000000"..., 1024) = 1024 <0.000063>
Andrew
  • 1,104
  • 6
  • 21
  • 37
  • That's a lot of time_wait in there. Are they all from the zabbix processes (do they show the IP of your monitoring server as the foreign address)? You could try using gdb and attaching to one of the zabbix processes to get a clue as to what it's trying to do. – Grant Mar 28 '13 at 13:30
  • can you please post the output of `sudo netstat -tnp|grep zabbix` ? – Mircea Vutcovici Mar 28 '13 at 13:33
  • CLOSE_WAIT means that the remote end has closed the connection, but the local application (zabbix_agentd in this case) is answering slow or not answering. Check how long a connection stays in CLOSE_WAIT. If it is more than a few seconds, the application is very slow or frozen. – Mircea Vutcovici Mar 28 '13 at 14:26
  • Because you have a lot of TIME_WAIT connections, I think you have disabled HTTP keep-alive, or that the timeout is too short. Have you experimented win increasing this timeout? – Mircea Vutcovici Mar 28 '13 at 14:29
  • Thank you, I will monitor the CLOSE_WAIT connections. I experimented with Zabbix timeouts both from the server and agent side. It did not help to decrease load but I will check how it implies the CLOSE_WAIT connections state. Or maybe you meant the web server timeouts? – Andrew Mar 28 '13 at 14:32
  • CLOSE_WAIT means that kernel knows that the remote end has closed the connection and it has notified the application htat remote end has closed the connection. The application should send remain data and close his endpoint. If this is not happening, then it could be that the application is slow, frozen, or has a bug in the network stack. – Mircea Vutcovici Mar 28 '13 at 14:35
  • TIME_WAIT means that a connection was completely closed, but the kernel is keeping the state so that new packets for this connection will be discarded. If the ESTABILISHED/TIME_WAIT is small for a web server, this usually means that keep-alive is disabled or timeout is very small. It is not necessarily a bad thing. You have to tune and measure the performance by changing the keep-alive timeout. – Mircea Vutcovici Mar 28 '13 at 14:37
  • @andrew you might also try turning off checks a few at a time to narrow down if it's a particular item that's causing the issues. – Grant Mar 28 '13 at 15:19
  • thanks, but already tried this way.. tried disabling set of checks, then another one, even all of them – Andrew Mar 28 '13 at 15:59

3 Answers3

0

I think that the bottleneck is the disc. Here are my reasons for this:

You have a pretty busy web server. Zabbix is slow, I suspect to be reads from the disk (can be from the network too).

Run again the strace, and find the file descriptor in Zabbix

Then find if the file descriptor is a file or a socket:

ls -l /prod/<PID_of_straced_process>/fd/<FD_from_strace>

EDIT1:
You should not change the TIME_WAIT timeouts. The problem with small HTTP keep-alive, or with no HTTP keep-alive is that you increase the latency and bandwidth. Instead you should increase a little bit the HTTP keep-alive and install/enable SPDY.

EDIT2: Use dstat -ta 10 and compare the first line with the rest. The first line is the average since boot. Next lines are 10 seconds average (the last parameter).

EDIT3: Check if you do not have packets lost, use something like smokeping to monitor the server and the website from outside your network. You have a significant number of connections in CLOSING, FIN_WAIT1, FIN_WAIT2, SYN_RECV, LAST_ACK. I think your network is congested or you have a lot of short lived connections (confirmed by the high TIME_WAIT/ESTABILISHED ratio). See: http://en.wikipedia.org/wiki/Transmission_Control_Protocol#Protocol_operation

Mircea Vutcovici
  • 17,619
  • 4
  • 56
  • 83
  • Right now, agent keeps asking /proc/diskstats all of the time. strace shows that syscall 'nanosleep' is used 72% of time. I will test disk but from the first glance vmstat shows there are no blocking processes. Also, I will try to play with nginx keep alive directive and timeouts tomorrow and post an update here. – Andrew Mar 28 '13 at 14:57
  • There is no packet loss, dstat shows everything is fine with disk, RAM, etc. I tried increasing keep alive for nginx connections but that did not seem to help. – Andrew Mar 29 '13 at 13:51
0

zabbix-agentd read /proc/net/tcp per net.tcp.listen item. the file size is about 100K(lines) * 150bytes = 15MB, if you have many tcp.listen monitor items, this read file operation consumes much cpu for the data size is 15MB*item_number.

It's recommend to use net.tcp.port instead of net.tcp.listen for this performance issue.

0

late answer (might be helpful for some guys):

That happens, often, depending on what do you request with Zabbix and usually it's a third party problem or PEBKAC.

Disable checks (and restart zabbix server after) to see which one is causing that heavy load. Analyze problem accordingly.

i.e. I had several issues with Database Monitor, when ODBC was causing the problem

GioMac
  • 4,544
  • 4
  • 27
  • 41