0

I'm running a couple of servers which need a pretty tight time sync (<50ms) as they are running a Paxos algorithm. The servers are running NTP and are successfully sync at one point. According to hwclock the 11-minute mechanism is enabled, so the system time should be copied to hardware clock.

However, I see that after a reboot the system time can be off by as much as 300ms compared to the time just before a reboot. Is it unreasonable to think that after a reboot the time should be within 50ms of the time just before reboot?

hbogert
  • 411
  • 1
  • 5
  • 18
  • Have you considered replacing the Paxos algorithm with an asynchronous algorithm? – kasperd Feb 12 '18 at 00:29
  • Outdated. In 2018 keeping servers in sync within 1ms is trivial, standard and in some industries legally mandated (banking, trading) for auditing purposes. Being hundreds of MS off means either really crappy hardware (by todays standards) or - all answers are wrong and there is aserious issue in configuration. – TomTom Jun 17 '18 at 17:55
  • @TomTom Or there is a third option, that is you are wrong, and it is not trivial after a reboot to have `ms` accurate time. – hbogert Jun 18 '18 at 10:17
  • @hbogert Ah, let's see. WIndows does it out of the box. You may want to check https://docs.microsoft.com/en-us/windows-server/networking/windows-time-service/accurate-time for a decent explanation - not only on situation there ,but they go into some details naming the legal requirements. You can easily solve that by virtualizing your Linux into Hyper-V - problem solved, time stable to less than 1ms. Now, for a reboot jsut wait a minute before starting processing after a reboot. Done. – TomTom Jun 18 '18 at 10:25
  • @TomTom You realize you just gave the exact solution of answers below; waiting on NTP sync before starting other processes. The original question was clearly about how it's possible that after a reboot, time can be off by multiple hundreds of milliseconds. I'm not sure what the addition of Microsoft Windows solves of the fundamental problem given in the question. – hbogert Jun 19 '18 at 10:53

2 Answers2

5

I do not have numbers to produce, but it seems probable that the interface used to set the clock at boot only has precision down to the second.

You do not state your OS, but on all Unix-like systems it is possible to insert a dependency on NTP time in the boot process.

The NTP daemon is started at boot, but often it immediately backgrounds itself and boot continues while the NTP daemon looks for servers to sync to -- this is so that boot is not delayed in case the machine is not connected to the network.

In this case, you will want to make sure that the ntp daemon is started in a way that will correct an offset by stepping at boot. This can be, for example, ntpd -gx or chronyc -q. You may also wish to insert a check that the offset is acceptable before starting your workload.

Law29
  • 3,557
  • 1
  • 16
  • 28
  • can you name a technical cause for the hundreds of milliseconds drift during a reboot? – hbogert Feb 11 '18 at 22:04
  • 2
    Not really. Even the pre-HPET clocks have a precision of some 0.03 ms (2^15 Hz). Maybe the time is not actually written to the hardware clock (you can check that by reading it and comparing to the system clock), or maybe there is some problem when reading it during the boot process. – Law29 Feb 11 '18 at 22:22
  • 2
    Just to clarify, I think that sub-50ms is a bit too too demanding, but 300ms is quite a lot. It could actually be that the time is being set with a one-second resolution somewhere. – Law29 Feb 12 '18 at 08:02
  • 1
    There should be no need to do the equivalent of ntpdate -b before starting ntpd on most distros; -g is included by default, which allows the first step to be large. – Paul Gear Feb 12 '18 at 08:58
  • @PaulGear yes, added (I'd call it kind of an equivalent of `ntpdate -b`, but `ntpd -g` is the preferred way to do it) – Law29 Feb 12 '18 at 10:51
  • @Law29 The realtime clock only has a resolution of seconds. But still, the kernel's 11-minute should synchronize the time on whole seconds. Reading should be timed to be on a whole second as well. I'm beginning to doubt the systemd implemention during boot. Ill try to test with different hardware and non-systemd distros. – hbogert Feb 12 '18 at 15:52
  • After reading kernel source I can't seem to find any indication that reading the time on boot is synchronized on a clock edge. So that would mean that statistically the offset at boot can be drawn from a uniform distribution. I started searching at https://elixir.free-electrons.com/linux/latest/source/drivers/rtc/hctosys.c – hbogert Feb 13 '18 at 00:44
  • Well, I think you found it; if the RTC clock can only be read with a one-second precision then being some 0.3 seconds off would be likely (unless it's 0.3 seconds every single time you boot, of course), and if there is really no way to get better resolution out of the API (which is confirmed by https://linux.die.net/man/4/rtc), you'd have to spin reading the clock until it changes! There might be something new with HPET, https://blog.fpmurphy.com/2009/07/linux-hpet-support.html gives an example of clock_gettime with timespec, which should provide nanoseconds. Do you have HPET? – Law29 Feb 13 '18 at 21:10
  • Or not, https://www.kernel.org/doc/Documentation/timers/hpet.txt says that it is just a timer. – Law29 Feb 13 '18 at 21:12
3

My initial reaction was that 300ms seems like an awful lot, but I do have numbers to produce, and they show that @Law29 is right:

  1. One of my machines over a normal week:
    • Frequency: frequency
    • System peer offset: sysoffset
  2. Same system, shorter period with a reboot involved:
    • Frequency: frequency-reboot
    • System peer offset: sysoffset-reboot
    • Scatter plot of the peers peerstatsplot-reboot

(Hope you can read all the numbers on the graphs OK - drop me a comment if not.)

As you can see, there's a rather large discrepancy. It surprised me how much it was, and also how long it took to get back on track with the frequency correction, considering that there's a stratum 1 GPS source on my local network. And given that the peer samples are fairly tightly clustered on the plot, it's clearly a problem with the local clock, not inconsistent network delay during startup. (For the record, the hardware is a Shuttle DS437 fanless mini-PC with a dual-core Celeron 1037U @ 1.8 GHz.)

So the takeaways are probably:

  1. make sure ntpd is successfully writing the NTP drift file,
  2. make sure the kernel's 11-minute timer to update the hardware clock is on (See "Automatic Hardware Clock Synchronization by the Kernel" in man hwclock for details), or your shutdown process is updating the hardware clock,
  3. make sure ntpd has 4-10 reachable sources (in iburst mode), and
  4. configure your startup dependencies so that ntpd has a chance to fix the clock before Paxos starts.
Paul Gear
  • 4,367
  • 19
  • 38
  • Not sure how I should interpret your second graph. Further, the drift file of NTPd is not for correcting the hardware clock right? The takeaway of saving to hwclock on shutdown seems to be outdated as well, since a lot of distro's nowadays use Systemd whose services rely on the `11-minute` kernel mechanism. I am going to monitor it just like you I guess. – hbogert Feb 12 '18 at 10:14
  • The second graph is just the usual system offset over a week, the point of which is merely to show that it can keep reasonably close time (+/- 5ms max/min, average around 1ms) normally, whereas I got large positive & negative offsets after the reboot. – Paul Gear Feb 12 '18 at 11:26
  • The drift file is where ntpd saves the local clock frequency error on an hourly basis. It uses it on startup as a seed to reduce amount of work it has to do to calculate it. But if it's not writing it (it's surprising how many times simple permissions problems break things), then it will necessarily be less accurate after reboot. – Paul Gear Feb 12 '18 at 11:28
  • The 11-minute timer was a new one on me. Looks like ntpd will turn it on normally as well. So the main thing is to confirm that it's actually on and working. – Paul Gear Feb 12 '18 at 11:33
  • And have you checked that your ntpd is actually stepping not slewing at boot? Are you using `-g` only or `-gx`? I know `-x` shouldn't change anything and that `-g` should step if offset is over 128ms, but can't hurt to be sure. Maybe the initial ntpd estimate is off and the initial stepping actually hurts! Chrony might handle things better. – Law29 Feb 12 '18 at 11:45
  • @PaulGear Are you sure? I thought only `/etc/adjtime` was used for hardware clock drift correction at boot, not NTP's drift file. – hbogert Feb 12 '18 at 17:28
  • @hbogert Sure about what? I don't really understand what you're asking. NTP maintains the drift file (usually `/var/lib/ntp/ntp.drift`) to track the frequency error of the system clock (the interrupt-driven software one). It saves this hourly, and reads it when `ntpd` starts, to initialise the system clock frequency error. According to `man hwclock`, the kernel will automatically update the hardware clock from the system clock if 11-minute mode is on (which it normally is if `ntpd` is running), and it seems that `/etc/adjtime` is not used at all in that case. – Paul Gear Feb 12 '18 at 22:10