After hours of running, NTP stops working

Question

My NTP servers work great for a couple hours, then they stop working and show "reach: 0" for all hosts, like so:

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 64-250-105-227. .PPS.            1 u   9h 1024    0   66.644    5.476   0.000

If I restart ntpd, they work fine again for about another 8 hours, but eventually get back like this. tcpdump reveals that they are still sending and receiving packets just fine (the routing is a little weird because our ISP blocks NTP traffic, but we have another way out with a little policy based routing and a guest running OpenVPN):

12:05:43.513183 IP (tos 0xc0, ttl 64, id 57760, offset 0, flags [DF], proto UDP (17), length 76)
    pvelocalhost.ntp > 64-250-105-227.ethoplex.com.ntp: [bad udp cksum 0x40e6 -> 0x6cec!] NTPv4, length 48
    Client, Leap indicator:  (0), Stratum 2 (secondary reference), poll 10 (1024s), precision -23
    Root Delay: 0.066635, Root dispersion: 0.601440, Reference-ID: 64-250-105-227.ethoplex.com
      Reference Timestamp:  3696656842.987997412 (2017/02/21 03:07:22)
      Originator Timestamp: 3696656843.552259385 (2017/02/21 03:07:23)
      Receive Timestamp:    3696656843.580105364 (2017/02/21 03:07:23)
      Transmit Timestamp:   3696689143.513155341 (2017/02/21 12:05:43)
        Originator - Receive Timestamp:  +0.027845976
        Originator - Transmit Timestamp: +32299.960896015
12:05:43.513708 IP (tos 0xc0, ttl 63, id 57760, offset 0, flags [DF], proto UDP (17), length 76)
    gateway.example.com.ntp > 64-250-105-227.ethoplex.com.ntp: [udp sum ok] NTPv4, length 48
    Client, Leap indicator:  (0), Stratum 2 (secondary reference), poll 10 (1024s), precision -23
    Root Delay: 0.066635, Root dispersion: 0.601440, Reference-ID: 64-250-105-227.ethoplex.com
      Reference Timestamp:  3696656842.987997412 (2017/02/21 03:07:22)
      Originator Timestamp: 3696656843.552259385 (2017/02/21 03:07:23)
      Receive Timestamp:    3696656843.580105364 (2017/02/21 03:07:23)
      Transmit Timestamp:   3696689143.513155341 (2017/02/21 12:05:43)
        Originator - Receive Timestamp:  +0.027845976
        Originator - Transmit Timestamp: +32299.960896015
12:05:43.573035 IP (tos 0x8, ttl 52, id 38657, offset 0, flags [DF], proto UDP (17), length 76)
    64-250-105-227.ethoplex.com.ntp > gateway.example.com.ntp: [udp sum ok] NTPv4, length 48
    Server, Leap indicator:  (0), Stratum 1 (primary reference), poll 10 (1024s), precision -18
    Root Delay: 0.000000, Root dispersion: 0.001205, Reference-ID: PPS^@
      Reference Timestamp:  3696689128.863678634 (2017/02/21 12:05:28)
      Originator Timestamp: 3696689143.513155341 (2017/02/21 12:05:43)
      Receive Timestamp:    3696689143.547838270 (2017/02/21 12:05:43)
      Transmit Timestamp:   3696689143.548149943 (2017/02/21 12:05:43)
        Originator - Receive Timestamp:  +0.034682918
        Originator - Transmit Timestamp: +0.034994553
12:05:43.573264 IP (tos 0x8, ttl 51, id 38657, offset 0, flags [DF], proto UDP (17), length 76)
    64-250-105-227.ethoplex.com.ntp > pvelocalhost.ntp: [udp sum ok] NTPv4, length 48
    Server, Leap indicator:  (0), Stratum 1 (primary reference), poll 10 (1024s), precision -18
    Root Delay: 0.000000, Root dispersion: 0.001205, Reference-ID: PPS^@
      Reference Timestamp:  3696689128.863678634 (2017/02/21 12:05:28)
      Originator Timestamp: 3696689143.513155341 (2017/02/21 12:05:43)
      Receive Timestamp:    3696689143.547838270 (2017/02/21 12:05:43)
      Transmit Timestamp:   3696689143.548149943 (2017/02/21 12:05:43)
        Originator - Receive Timestamp:  +0.034682918
        Originator - Transmit Timestamp: +0.034994553

Long story short here, you can see the packets leaving headed towards 64-240-105-227.ethoplex.com.ntp and you can see we get a response the same way back. The first UDP checksum is bad, probably because of the TOE, but it all seem to work itself out after gateway masquerades as the source IP and recomputes the checksum on the packets.

What is going on? And what options do I have besides setting up a cron job to restart NTP every couple of hours?

I have seen this happen once or twice before, but never had a chance to investigate thoroughly. It only happened to one peer in my case, so my first piece of advice is: add more peers, and preferably use the pool directive to configure them so that they are automatically replaced if unavailable. Next, check syslog for any evidence of problems; my guess is there will be none. — Paul Gear, Feb 24 '17 at 11:20
My suspicion about the cause of this in my case was that there was a faulty network link between the two servers in question, possibly causing my client to repeat packets and run afoul of the server's rate limiting. If you control the server, run "ntpq -nc mrulist" to see if your client has been rate limited or KoDed. — Paul Gear, Feb 24 '17 at 11:27
@PaulGear I'm a little too early or 'mrulist', I'm on "Ver. 4.2.6p5". I'm using two selected stratum 1 servers, as well as two peers on the same LAN (using their own selected stratum 1 servers), all four have this behavior, and always show the same last reached time, so it happens simultaneously. I could try using the pool directive, but they I might have as many as all three local servers using the same remotes. Any other ways you know of to check rate limiting? Although, tcpdump isn't exactly showing spamy frequencies of NTP packets. — Isabell Cowan, Feb 24 '17 at 15:49
I'm rather inclined after this level of frustration to blame this build or version of NTP, and try my own with the latest sources. — Isabell Cowan, Feb 24 '17 at 15:55
That is an older version, and it would be great to try one of the more recent versions. I saw the problem on 4.2.6p3, and haven't seen it on 4.2.8 or later. — Paul Gear, Feb 26 '17 at 06:23

After hours of running, NTP stops working

0 Answers0