Prometheus Alerting on NodeClockNotSynchronising for VMs

Question

I'm trying to determine why this Alert (NodeClockNotSynchronising) is firing for a handful of VMs I've provisioned. (Not all just a few, which is strange)

According to the metrics that are exported, I'm seeing:

# HELP node_timex_sync_status Is clock synchronized to a reliable server (1 = yes, 0 = no).
# TYPE node_timex_sync_status gauge
node_timex_sync_status 0

I can ssh into one of the VMs and ntpd is running and the date command returns the correct time.

So digging into the timex collector documentation and code here's what is "failing":

    var syncStatus float64
    var divisor float64
    var timex = new(unix.Timex)

    status, err := unix.Adjtimex(timex)
    if err != nil {
        return fmt.Errorf("failed to retrieve adjtimex stats: %w", err)
    }

    if status == timeError {
        syncStatus = 0
    } else {
        syncStatus = 1
    }

Since syncStatus is 0 the alert is being fired. Doing some digging into the return codes of adjtimex() syscall:

#define TIME_ERROR        5        /* clock not synchronized */

Why would the kernel return TIME_ERROR when ntpd is running and the clock is synchronized? Any help would be greatly appreciate.

score 1 · Accepted Answer · answered Nov 22 '20 at 19:43

Whatever the ntpd you are running, the kernel time discipline is reporting an error.

See man ntp_adjtime for the API and related constants.

On Linux, this could either be from NTP or PPS. Let's assume NTP, and further assume the error status is STA_UNSYNC, unsynchronized. This is set at boot. And cleared if a system call is done with a ADJ_OFFSET mode, in other words if a ntpd is attempting to gradually change the clock. This not happening does not make sense, all clocks will be at least a little bit off.

Review your /etc/ntp.conf. Ensure it contains 4 or more sources via server or pool directives. Delete any undisciplined local clocks, which begin with server 127.127.1. LOCL is obsolete, most server clocks are not amazing, and possibly the 0 offset is preventing the unsync from being cleared.

Restart ntpd and wait two minutes. Watch the offsets compared to the NTP servers with ntpq -p, or chronyc sources -v, should be single or double digit ms but not zero.

Double check the date. Try it without confusing time zones: date --utc

Yep, the `ntp.conf` seem to be different for the ones triggering this alert. Thank you for you help, I appreciate it. I got this fixed but using the correct `server` source and restarted the `ntpd` — Gerb, Nov 30 '20 at 16:49
Good. This is a useful check if it caught non functional NTP configurations. — John Mahowald, Nov 30 '20 at 17:12

Prometheus Alerting on NodeClockNotSynchronising for VMs

1 Answers1