2

Backstory: I have a couple of internal startum 1 NTP clocks with GPS receivers, and 2 public NTP servers that are virtualized on top of VMware ESXi which take time from the S1 clocks and distribute it. Otherwise this setup works rather fine and provides good time when compared to other public servers.

Problem: When I reboot the virtual machines, they do not start syncing properly, and get stuck in an unsynchronised state. Below is the ntpq -p output after a reboot.

root@server:~$ ntpq -p
 remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 192.168.1.40    .GPS.            1 u   27   64    3    1.533  -258.43 5948.73
 192.168.2.40    .GPS.            1 u   24   64    3    1.118  -258.47 6138.19
 192.168.3.42    .GPS.            1 u   24   64    3    0.709  -258.42 5655.02
 194.100.49.151  194.100.49.134   2 u   22   64    3    8.124  -258.74 7131.65
 gbg1.ntp.se     .PPS.            1 u   26   64    3   21.856  -258.43 4876.90
 ntp2.sptime.se  .PPS.            1 u   23   64    3   19.991  -258.42 7764.97
 ntp1.sptime.se  .PPS.            1 u   27   64    3   20.489  -258.41 8574.46

If I then run ntp service restart I get this:

root@server:~$ ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 192.168.1.40    .GPS.            1 u    2   64    1    1.517  -258.45   0.065
 192.168.2.40    .GPS.            1 u    1   64    1    1.126  -258.46   0.025
 192.168.3.42    .GPS.            1 u    2   64    1    0.719  -258.42   0.020
 194.100.49.151  194.100.49.134   2 u    5   64    1    8.041  -258.72   0.000
 gbg1.ntp.se     .PPS.            1 u    6   64    1   21.839  -258.41   0.000
 ntp2.sptime.se  .PPS.            1 u    4   64    1   19.968  -258.41   0.000
 ntp1.sptime.se  .PPS.            1 u    3   64    1   20.418  -258.43   0.000

A second later it steps:

root@server:~$ ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 192.168.1.40    .STEP.          16 u    2   64    0    0.000    0.000   0.000
 192.168.2.40    .STEP.          16 u    2   64    0    0.000    0.000   0.000
 192.168.3.42    .STEP.          16 u    8   64    0    0.000    0.000   0.000
 194.100.49.151  194.100.49.134   2 u    -   64    1    7.976   -0.261   0.000
 gbg1.ntp.se     .PPS.            1 u    -   64    1   21.840    0.060   0.000
 ntp2.sptime.se  .STEP.          16 u    6   64    0    0.000    0.000   0.000
 ntp1.sptime.se  .STEP.          16 u    6   64    0    0.000    0.000   0.000

And after that we're back to normal operation:

root@server:~$ ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 192.168.1.40    .GPS.            1 u    1   64    1    1.474    0.044   0.017
*192.168.2.40    .GPS.            1 u    1   64    1    1.102    0.030   0.005
 192.168.3.42    .GPS.            1 u    1   64    1    0.674    0.049   0.009
 194.100.49.151  194.100.49.134   2 u    8   64    1    7.976   -0.261   0.000
 gbg1.ntp.se     .PPS.            1 u    8   64    1   21.840    0.060   0.000
 ntp2.sptime.se  .PPS.            1 u    6   64    1   19.979    0.059   0.000
 ntp1.sptime.se  .PPS.            1 u    5   64    1   20.440    0.048   0.000

So it seems that after reboot the system clock is off by quite a bit, which is to be expected, but why ntpd doesn't panic and just steps the clock is a bit hard for me to understand.

Here's my ntp.conf

tinker panic 0
# /etc/ntp.conf, configuration for ntpd; see ntp.conf(5) for help

driftfile /var/lib/ntp/ntp.drift


# Enable this if you want statistics to be logged.
statsdir /var/log/ntpstats/

statistics loopstats peerstats clockstats
filegen loopstats file loopstats type day enable
filegen peerstats file peerstats type day enable
filegen clockstats file clockstats type day enable


# You do need to talk to an NTP server or two (or three).
#server ntp.your-provider.example

# pool.ntp.org maps to about 1000 low-stratum NTP servers.  Your server will
# pick a different set every time it starts up.  Please consider joining the
# pool: <http://www.pool.ntp.org/join.html>
server 192.168.1.40  iburst
server 192.168.2.40 iburst
server 192.168.3.42 iburst
server time1.mikes.fi
server ntp1.gbg.netnod.se
server ntp2.sptime.se
server ntp1.sptime.se

# Access control configuration; see /usr/share/doc/ntp-doc/html/accopt.html for
# details.  The web page <http://support.ntp.org/bin/view/Support/AccessRestrictions>
# might also be helpful.
#
# Note that "restrict" applies to both servers and clients, so a configuration
# that might be intended to block requests from certain clients could also end
# up blocking replies from your own upstream servers.

# By default, exchange time with everybody, but don't allow configuration.
restrict -4 default kod notrap nomodify nopeer noquery
restrict -6 default kod notrap nomodify nopeer noquery

# Local users may interrogate the ntp server more closely.
restrict 127.0.0.1
restrict ::1

# Clients from this (example!) subnet have unlimited access, but only if
# cryptographically authenticated.
#restrict 192.168.123.0 mask 255.255.255.0 notrust


# If you want to provide time to your local subnet, change the next line.
# (Again, the address is an example only.)
#broadcast 192.168.123.255

# If you want to listen to time broadcasts on your local subnet, de-comment the
# next lines.  Please do this only if you trust everybody on the network!
#disable auth
#broadcastclient
Stuggi
  • 3,506
  • 4
  • 19
  • 36

1 Answers1

1

ntpd default step threshold is 0.125 s, and panic threshold after the first packet is 1000 s. In other words, out of design conditions includes an offset jumping by 15+ minutes.

You captured the initial packet, the step, and eventually peer selection. Takes a minute or two to establish, due to how the NTP algorithms work, even if you use the iburst option. Reach of 3 indicates only two packets were received so far. Wait longer, if you are not dropping NTP packets.

If the initial offsets or stepping is not acceptable, you can wait until ntpd or the operating system reports synchronized. For systemd on Linux, try depending on systemd-time-wait-sync.service.

John Mahowald
  • 32,050
  • 2
  • 19
  • 34
  • @Stuggi, but you should still definitely add the `iburst` option. It should still help to get sync faster than if you don't have it. Also make sure your ESXi host has a good NTP configuration (and that guest-to-host clock syncing is off) so that the VM has the best possible start. – Paul Gear May 16 '19 at 21:47