Nagios - check_ntp_time - Offset Unknown

Question

I have a local NTP server running on the subnet to keep other subnet nodes in sync, without every node syncing with upstream servers. But, while implementing the check_ntp_time plugin for Nagios I am noticing a frustrating issue, where nagios keeps reporting critical error for local nodes syncing up with the local ntp server.

Here is the ntp config on the local ntp server, notice the upstream server entries and the restrict entry, according to my research this qualifies the node as an ntp server which local nodes can sync against.

driftfile /var/lib/ntp/drift

# Permit time synchronization with our time source, but do not
# permit the source to query or modify the service on this system.
restrict default kod limited nomodify notrap nopeer noquery
restrict -6 default kod limited nomodify notrap nopeer noquery

# Permit all access over the loopback interface.  This could
# be tightened as well, but to do so would effect some of
# the administrative functions.
restrict 127.0.0.1
restrict -6 ::1

# Makes me able to answer requests from local nodes
restrict 10.0.0.0 mask 255.255.192.0 nomodify notrap

# My source
server 0.centos.pool.ntp.org iburst
server 1.centos.pool.ntp.org
server 2.centos.pool.ntp.org

logfile /var/log/ntp/server.log

statistics loopstats
statsdir /var/log/ntp/
filegen peerstats file peers type day link enable
filegen loopstats file loops type day link enable

And on the local non-ntp server nodes, everything is the same except the restrict entry is removed, and the server entries reference only the local ntp server: server ntp.example.com iburst.

Every local node can resolve ntp.example.com.

The problem I am having is when I run the following command from the nagios server:

/usr/lib64/nagios/plugins/check_ntp_time -H node-a-1 -v

And the output:

sending request to peer 0
response from peer 0: offset -0.002921819687
sending request to peer 0
response from peer 0: offset -0.0001939535141
sending request to peer 0
re-sending request to peer 0
re-sending request to peer 0
re-sending request to peer 0
re-sending request to peer 0
re-sending request to peer 0
re-sending request to peer 0
discarding peer 0: stratum=0
overall average offset: 0
NTP CRITICAL: Offset unknown|

This happens for all the nodes, except the local ntp server, which references upstream servers. At first I thought it was IPTables issue, but I have the ports pinholed on every local ntp node (to allow nagios access to check the time diff):

ACCEPT     udp  --  eth0   *       10.0.0.0/18          0.0.0.0/0           multiport dports 123 /* 777 allow ntp access */ state NEW

Versions:

nagios-plugins-ntp: 1.4.16
ntp: 4.2.6p5-1.el6.centos

Any help is greatly appreciated, I really can't submit the nagios work until I get this resolved, as you know keeping server times in sync is priority 1.

-- Edit --

Per the comments, here are the results of ntpq -p, on various nodes:

# Actual NTP Server (10.0.0.2)
==============================================================================
+propjet.latt.ne 241.199.164.101  2 u  105  128  337   14.578   12.954   7.138
+x2la01.hostigat 63.145.169.2     3 u   21  128  377   16.037   13.546   4.090
*pacific.latt.ne 241.199.164.101  2 u   72  128  377   15.148   24.434   7.403

# Local node 1
==============================================================================
*service-a-1.sn1 204.2.134.163    3 u    9  128  377    0.228    5.217   1.296

# Local node 2
==============================================================================
*service-a-1.sn1 204.2.134.163    3 u   91  128  377    0.200    3.608   1.167

Check the servers to make sure they are successfully using the local master NTP server as a preferred peer. Log into one and run `ntpq -p ::1`, and check that the master NTP server is marked with an asterisk. — aecolley, Aug 29 '14 at 16:51
Duplicate of [Nagios NTP, discarding peer](http://serverfault.com/q/269701/50647)? Same version of the plugin, too. — Aaron Copley, Aug 29 '14 at 16:56
@AaronCopley No, he claims to be running 1.4.15, I'm running 1.4.16. Also, every request returned a response, if you notice in my request, only the first two returned a response, not sure why it has all the extra 're-sendings'. — Mike Purcell, Aug 29 '14 at 17:09
Posted by a different user, with no confirmation by OP. I am trying to find the change-log for that particular plugin. — Mike Purcell, Aug 29 '14 at 17:18
Point was, it's a lead with indication that you aren't alone in this issue. I didn't downvote you or vote to close or anything. Just giving a lead. — Aaron Copley, Aug 29 '14 at 17:32
No worries. I did see that post yesterday during my research, but led me to a dead-end. — Mike Purcell, Aug 29 '14 at 20:46
@utrecht: No I have not. TBH it's been a long while since the post, I will try to check back into it asap. — Mike Purcell, Jan 26 '15 at 22:00
@MikePurcell Perhaps the ESX time is used. [This answer](http://unix.stackexchange.com/questions/150020/how-to-disable-time-sync-in-vm-guest/150142#150142) helped me to solve the issue. — 030, Jan 26 '15 at 22:46

score 8 · Accepted Answer · answered Sep 04 '14 at 15:38

The key line here is this one:

discarding peer 0: stratum=0

An NTP server identifying itself as stratum 0 is a violation of the spec (it's reserved for atomic clocks or something like that). I had this problem years ago with some BSD and Mac OS X hosts. I ended up hacking the stratum check out of the source and maintaining a separate build of the plugin for "problematic" hosts.

The offending lines are 254-257 (currently, anyway), if you want to rip that out. It's a hack, but it works for me ;-)

I found this thread in the mailing list archives about it. I think there was another one where I suggested adding a command-line option to ignore the stratum check, but I don't think it got any traction.

There's also a bug report about it, but it hasn't yielded anything useful as far as I can tell.

This answer is a blast from my past. Same workaround deployed many years ago. Sigh. — dmourati, Jan 12 '15 at 06:45

score 1 · Answer 2 · answered Jan 07 '19 at 18:12

I removed the problem be disabling the KOD (kiss-of-death) feature on the NTP server.

check_ntp sends (at least) 4 requests in quick succession to calculate a statistically sound average offset. The third and all following requests are considered a denial of service attack by the server and are answered with a KOD message (invalid stratum, namely 0). In fact, this behaviour should be considered a bug of check_ntp as KOD must be processed properly by the client.

Nagios - check_ntp_time - Offset Unknown

2 Answers2

Linked