3

I've got a remote unmanned server that's been exhibiting some extremely strange clock/NTP behavior lately. Symptoms:

  • Very high jitter
  • ntpq -pn returns:
    • A resetting 'when' count of back to 1, even though the NTP server is literally 1m of CAT5e away and directly connected to the machine in question. No signs of packet loss or other comms breakdown.
    • Frequently, a refid of 'LOCAL(0)' even though I know the NTP server in question is having no issues reaching its stratum 2 server.
admin@machine:~$ date && ntpq -pn
Thu 24 May 19:34:02 UTC 2018
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 <local_ntpserver>  LOCAL(0)        15 u  120  128  377    0.120  -486.68 909.283

 admin@machine:~$ date && ntpq -pn
Thu 24 May 19:38:37 UTC 2018
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 <local_ntpserver>  <remote_ntpserver>    3 u    1  128  377    0.123  -1854.0 2164.83

From the local NTP server (i.e. the machine running at the same physical location):

      remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 <remote_ntpserver>   <remote_ntpserver2>  2 u   49   64  377  5076.18  1546.21 299.468

You can see that this local NTP server has good reach and relatively low jitter, despite being across a high-latency wireless network.

I've modified minpoll and maxpoll to low values (4, 5) on the primary machine so that ntp is running more frequently and this "bandaid" solution seems to be keeping the primary machine somewhat tethered to reality (unlike before where it was drifting minutes away multiple times a day), but I'd like to get to the root of this weird behavior.

I have a theory that the tsc clock could be drifting wildly, but I have no evidence of this. It would explain the high jitter though, and this in turn could maybe introduce some weird behavior in NTP.

Regardless, I don't understand why the refid keeps reverting to 'LOCAL (0)' when this clearly isn't the case. The NTP service is not restarting. For example:

● ntp.service - LSB: Start NTP daemon
   Loaded: loaded (/etc/init.d/ntp)
   Active: active (running) since Wed 2018-05-23 15:58:50 UTC; 1 day 3h ago

but I've observed numerous cases of this reversion to 'LOCAL (0)' in the last few hours, so it's not like it's starting from scratch and needs time to initialize or collect the right data.

  • Is this a VM or physical box running all these clocks? it looks a bit of a mess but not sure where the issue lies without you describing you entire time domain setup. What are all the servers, where do they get there time from and the output from `ntpq -pcrv` on all of them – user3788685 Jun 03 '18 at 12:04
  • 2
    In my opinion, `local_ntpserver` doesn't have relatively low jitter. All being well, jitter should be single digits or less. And its offset is poor. If `local_ntpserver` is the only time source for your problematic system, you need to fix it first. Start by using at least 4 remote sources. If you want further suggestions, supply `ntpq -npc rv` output as @user3788685 suggested. – Paul Gear Dec 03 '18 at 02:44

0 Answers0