Immediately fix system clock after VMware stun

Question

It's a known issue with VMware that it performs what is called a "stun" during certain operations, such as vmotion, and snapshot create/delete. During this stun, the guest OS is frozen, and so when it comes back, the system clock is behind. Now the stuns are usually pretty quick in human terms (sub-second), but in machine terms they're pretty long, several hundred milliseconds. And the times get worse on VMs with bigger disks, or more memory, which are often the VMs that are more critical. With systems that communicate with each other, these time differences can cause problems.

But in any case, the issue I am trying to address is the clock. The ultimate requirement is to immediately get the system clock back in sync after a stun happens. "Immediate" might a vague term, so lets say within 1 second the clock should be back in sync.

We do use ntp for clock synchronization, but ntp takes several minutes (or longer) to get the system back in sync as it doesn't understand what just happened. It takes a while to verify the time is stable again (that drift rate hasn't spiked), and then slowly correct things. So it's not fast enough.

The best idea I've got is to immediately run ntpdate when a stun happens, but I do not know of any way for the guest OS to discover that a stun has happened.

The systems in question are Linux (CentOS/7).

*but I do not know of any way for the guest OS to discover that a stun has happened.* - the host alerts the guest through VMware tools so guest applications (e.g. SQL, Exchange) can make sure their data stores are closed cleanly. You can hook into this mechanism yourself with [pre-freeze and post-thaw scripts](https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1006671) also [available on Linux](https://communities.vmware.com/thread/224772) — TessellatingHeckler, Mar 28 '17 at 18:04

score 0 · Answer 1 · answered Mar 28 '17 at 12:55

Do the VM hosts also use the same NTP sources as the guests?

These [timesync disable] options do not disable one-time synchronizations done by VMware Tools for events such as tools startup, taking a snapshot, resuming from a snapshot, resuming from suspend, or vMotion. These events synchronize time in the guest operating system with time in the host operating system, therefore it is important to make sure that the host operating system's time is correct. Timekeeping best practices for Linux guests

It is possible to set a smaller NTP step threshold, but databases especially don't deal well with time going backwards.

Hrm, that message would imply that as long as the tools are running, it's impossible to disable the one-time synchronization on those specific events. The tools are running, but are not doing the one-time syncs. — phemmer, Mar 31 '17 at 19:42

score 0 · Accepted Answer · edited Jun 11 '20 at 10:02

The official VMWare article and solution on this issue can be found here: https://kb.vmware.com/s/article/2108828

If adjustments to NTP prove to be insufficient in mitigating effects of time differences due to virtual machine migration, configure VMware tools one-time time synchronization to have a lower threshold value.

Run the command which uses vmx option pref.timeLagInMilliseconds, and defaults the time to 1000 (for 1 second):

For example, if you want the guest clock to be synchronized with the host, whenever time falls behind more than 100 milliseconds after migration, add this to your vmx file.

pref.timeLagInMilliseconds = 100

^{Documentation on editing the vmx file can be found here: https://kb.vmware.com/s/article/1714}

So for my situation I set the value to 10, so that if the time is off by more than 10 milliseconds after a stun then it gets synced by VMware. Then I let NTP handle the more granular adjustment from there.

Immediately fix system clock after VMware stun

2 Answers2