Zabbix Graphs Data is Intermittent

Question

We use Proxmox VE as our virtualization environment, and recently upgraded from 2.X -> 3.X. We also made the change from a single host to a dual host cluster. And, last but not least, we moved our VMs from a LVM backend to a GlusterFS one.

Here is a graph from just before the migration. Notice the clean, clear lines:

Before

Now, here is the same graph from right now:

After

My first thought was that virt1 wasn't responding quickly, so I used zabbix_get to test that theory, and here's the result:

[root@monit ~]# for i in {1..10}; do (time zabbix_get -s virt1 -k system.cpu.load[,avg1]) 2>&1 | grep -i real | awk '{print $2}'; sleep 1; done
0m0.011s
0m0.015s
0m0.010s
0m0.010s
0m0.010s
0m0.010s
0m0.010s
0m0.011s
0m0.011s
0m0.011s

The result is very quick, and certainly no where near the limit of three whole seconds.

Also, this doesn't happen on all hosts, for example, it happens on virt1, virt2, and a VM called nas, but not on any of the other VMs.

Hopefully, there's a Zabbix guru here who can help.

Thanks!

ETA:

Here are the stats that asaveljevs was talking about:

Timestamp               Value
2014.Aug.14 09:13:56    17
2014.Aug.14 09:13:27    18
2014.Aug.14 09:12:56    17

Are there any network errors regarding these hosts in the server log? Also, how busy your pollers are? Zabbix internal process load can be monitored using `zabbix[process,poller,avg,busy]` and other [similar items](https://www.zabbix.com/documentation/2.2/manual/config/items/itemtypes/internal). — asaveljevs, Aug 12 '14 at 06:41
@asaveljevs I'm running Zabbix 1.8.20, and I don't think those keys exist: `[root@monit ~]# zabbix_get -s localhost -k zabbix[process,poller,avg,busy] | wc -l: 0` — Soviero, Aug 12 '14 at 14:26
These keys do exist since Zabbix 1.8.5 (see https://www.zabbix.com/documentation/1.8/manual/config/items#internal_checks), but they are not agent checks. Rather, they are of type "Simple check" and are processed by the server itself. — asaveljevs, Aug 13 '14 at 06:15
Items of type "Zabbix internal" (the reference to "Simple check" above is a typo) are processed by the server itself and they cannot be run manually, i.e. they cannot be queried using `zabbix_get`. Instead, in Zabbix frontend they should be created on a host monitored by Zabbix server: create an item like you usually do, but select "Type" to be "Zabbix internal" and use "zabbix[process,poller,avg,busy]" as the "Key". — asaveljevs, Aug 14 '14 at 10:54
@asaveljevs Updated question with your requested information. — Soviero, Aug 14 '14 at 14:15
In the server log, are there any network errors regarding this host? An example error could be the following: "Zabbix agent item "mysql.ping" on host "sql" failed: first network error, wait for 15 seconds". — asaveljevs, Aug 15 '14 at 07:34
@asaveljevs I can't believe I forgot Zabbix had a server log! What do you think would cause these errors? http://pastebin.com/3k9jKCTw — Soviero, Aug 15 '14 at 23:34
@asaveljevs Just to clarify, I'm talking about the network errors for virt2. Not the active check errors, those aren't supposed to work for firewall reasons. — Soviero, Aug 16 '14 at 08:28
Items "swap.used" and "memory.used", which are complained about in the server log, are not Zabbix built-in items. This suggests that you are probably using user parameters on the agent. Could you please post how they are calculated and how long do they take to process (for instance, by performing the `zabbix_get` loop in the question, but replacing "system.cpu.load[,avg1]" with these keys)? How big is the "Timeout" parameter on the server and how big is it on the agent? — asaveljevs, Aug 18 '14 at 08:06
@asaveljevs Ok, so I fixed the errors on virt2 by changing the user parameters to use the right awk (/usr/bin/awk vs /bin/awk). However, one host (nas) is still a problem: http://pastebin.com/Ew89gJ7T Any ideas? — Soviero, Aug 20 '14 at 14:30
What is the value of StartAgents on the "nas" host? Theoretically, if it is low (e.g., 1), it might be running out of listeners. For instance, around "20140819:142858" we can see it being poller by 5 pollers simultaneously. However, the "nas" host only has a couple of network problems every couple of days, on average, so it might be a legitimate network problem. — asaveljevs, Aug 21 '14 at 06:54
@asaveljevs Interesting... I hadn't noticed that the errors were so far apart. The problem is that the graph for nas is broken similarly to virt2: http://i.imgur.com/vDYNQHy.png Those networking errors don't explain that though. — Soviero, Aug 21 '14 at 17:36
Could you please post Zabbix log for the period that is shown on the graph? In general, for further troubleshooting help, I might suggest you to refer to Zabbix IRC and forum (see https://www.zabbix.org/wiki/Getting_help). — asaveljevs, Aug 22 '14 at 06:50

Zabbix Graphs Data is Intermittent

0 Answers0