My NTP servers work great for a couple hours, then they stop working and show "reach: 0" for all hosts, like so:
remote refid st t when poll reach delay offset jitter
==============================================================================
64-250-105-227. .PPS. 1 u 9h 1024 0 66.644 5.476 0.000
If I restart ntpd
, they work fine again for about another 8 hours, but eventually get back like this. tcpdump
reveals that they are still sending and receiving packets just fine (the routing is a little weird because our ISP blocks NTP traffic, but we have another way out with a little policy based routing and a guest running OpenVPN):
12:05:43.513183 IP (tos 0xc0, ttl 64, id 57760, offset 0, flags [DF], proto UDP (17), length 76)
pvelocalhost.ntp > 64-250-105-227.ethoplex.com.ntp: [bad udp cksum 0x40e6 -> 0x6cec!] NTPv4, length 48
Client, Leap indicator: (0), Stratum 2 (secondary reference), poll 10 (1024s), precision -23
Root Delay: 0.066635, Root dispersion: 0.601440, Reference-ID: 64-250-105-227.ethoplex.com
Reference Timestamp: 3696656842.987997412 (2017/02/21 03:07:22)
Originator Timestamp: 3696656843.552259385 (2017/02/21 03:07:23)
Receive Timestamp: 3696656843.580105364 (2017/02/21 03:07:23)
Transmit Timestamp: 3696689143.513155341 (2017/02/21 12:05:43)
Originator - Receive Timestamp: +0.027845976
Originator - Transmit Timestamp: +32299.960896015
12:05:43.513708 IP (tos 0xc0, ttl 63, id 57760, offset 0, flags [DF], proto UDP (17), length 76)
gateway.example.com.ntp > 64-250-105-227.ethoplex.com.ntp: [udp sum ok] NTPv4, length 48
Client, Leap indicator: (0), Stratum 2 (secondary reference), poll 10 (1024s), precision -23
Root Delay: 0.066635, Root dispersion: 0.601440, Reference-ID: 64-250-105-227.ethoplex.com
Reference Timestamp: 3696656842.987997412 (2017/02/21 03:07:22)
Originator Timestamp: 3696656843.552259385 (2017/02/21 03:07:23)
Receive Timestamp: 3696656843.580105364 (2017/02/21 03:07:23)
Transmit Timestamp: 3696689143.513155341 (2017/02/21 12:05:43)
Originator - Receive Timestamp: +0.027845976
Originator - Transmit Timestamp: +32299.960896015
12:05:43.573035 IP (tos 0x8, ttl 52, id 38657, offset 0, flags [DF], proto UDP (17), length 76)
64-250-105-227.ethoplex.com.ntp > gateway.example.com.ntp: [udp sum ok] NTPv4, length 48
Server, Leap indicator: (0), Stratum 1 (primary reference), poll 10 (1024s), precision -18
Root Delay: 0.000000, Root dispersion: 0.001205, Reference-ID: PPS^@
Reference Timestamp: 3696689128.863678634 (2017/02/21 12:05:28)
Originator Timestamp: 3696689143.513155341 (2017/02/21 12:05:43)
Receive Timestamp: 3696689143.547838270 (2017/02/21 12:05:43)
Transmit Timestamp: 3696689143.548149943 (2017/02/21 12:05:43)
Originator - Receive Timestamp: +0.034682918
Originator - Transmit Timestamp: +0.034994553
12:05:43.573264 IP (tos 0x8, ttl 51, id 38657, offset 0, flags [DF], proto UDP (17), length 76)
64-250-105-227.ethoplex.com.ntp > pvelocalhost.ntp: [udp sum ok] NTPv4, length 48
Server, Leap indicator: (0), Stratum 1 (primary reference), poll 10 (1024s), precision -18
Root Delay: 0.000000, Root dispersion: 0.001205, Reference-ID: PPS^@
Reference Timestamp: 3696689128.863678634 (2017/02/21 12:05:28)
Originator Timestamp: 3696689143.513155341 (2017/02/21 12:05:43)
Receive Timestamp: 3696689143.547838270 (2017/02/21 12:05:43)
Transmit Timestamp: 3696689143.548149943 (2017/02/21 12:05:43)
Originator - Receive Timestamp: +0.034682918
Originator - Transmit Timestamp: +0.034994553
Long story short here, you can see the packets leaving headed towards 64-240-105-227.ethoplex.com.ntp
and you can see we get a response the same way back. The first UDP checksum is bad, probably because of the TOE, but it all seem to work itself out after gateway
masquerades as the source IP and recomputes the checksum on the packets.
What is going on? And what options do I have besides setting up a cron job to restart NTP every couple of hours?