2

I've got about 70 linux instances running on an OpenStack cluster that currently consists of two compute nodes and one controller. Also, these machines live in a RackSpace DC as part of their 'Private Cloud' program, so all of our resources are dedicated.

Previously we were using only RackSpace's NTP servers to synchronize the clocks on all of our instances, but Check_MK was frequently notifying us that the instances were syncing to themselves [stratum 10], implying that the NTP servers were not responding. Given that only 4/70+ instances had public IP addresses I assumed that RackSpace's NTP servers were ratelimiting us since they would be seeing 35+ times the normal rate of NTP queries originating from our two compute hosts. This seemed logical since the 4 instances with public IPs never generated any complaints about NTP.

To address this I changed ntpd.conf on our instances to include our controller node alongside the Rackspace servers so we would at least have a fallback when the RS servers stopped responding. [the NTP cookbook we are using does not allow us to set a preference] However, this has not stopped, or even reduced the number of NTP complaints. I've been seeing last entries in ntpq -p in excess of 60 minutes for all three hosts. I can't see how rate IP-based rate limiting might be coming into effect with the controller node since the instances and the controller reside on, and communicate through, a private network where every instance has its own IP address.

What could be causing this? As far as I've been able to tell there is nothing in the restrict default line that would cause what we're experiencing.

ntp.conf from an instance:

driftfile /var/lib/ntp/ntp.drift
statsdir /var/log/ntpstats/
leapfile /etc/ntp.leapseconds

statistics loopstats peerstats clockstats
filegen loopstats file loopstats type day enable
filegen peerstats file peerstats type day enable
filegen clockstats file clockstats type day enable


server controller01.dfw.domain.com iburst
restrict controller01.dfw.domain.com nomodify notrap noquery
server time.dfw1.rackspace.com iburst
restrict time.dfw1.rackspace.com nomodify notrap noquery
server time2.dfw1.rackspace.com iburst
restrict time2.dfw1.rackspace.com nomodify notrap noquery

restrict default kod notrap nomodify nopeer noquery
restrict 127.0.0.1 nomodify
restrict -6 default kod notrap nomodify nopeer noquery
restrict -6 ::1 nomodify


server  127.127.1.0 # local clock
fudge   127.127.1.0 stratum 10

ntp.conf from the controller node:

driftfile /var/lib/ntp/ntp.drift
statsdir /var/log/ntpstats/
leapfile /etc/ntp.leapseconds

statistics loopstats peerstats clockstats
filegen loopstats file loopstats type day enable
filegen peerstats file peerstats type day enable
filegen clockstats file clockstats type day enable


server 0.pool.ntp.org iburst
restrict 0.pool.ntp.org nomodify notrap noquery
server 1.pool.ntp.org iburst
restrict 1.pool.ntp.org nomodify notrap noquery
server 2.pool.ntp.org iburst
restrict 2.pool.ntp.org nomodify notrap noquery
server 3.pool.ntp.org iburst
restrict 3.pool.ntp.org nomodify notrap noquery

restrict default kod notrap nomodify nopeer noquery
restrict 127.0.0.1 nomodify
restrict -6 default kod notrap nomodify nopeer noquery
restrict -6 ::1 nomodify


server  127.127.1.0 # local clock
fudge   127.127.1.0 stratum 10
  • Controller node OS is Ubuntu 12.04.3 LTS running ntpd 4.2.6p3
  • Instance OSes are Centos 6.4/6.5 running ntpd 4.2.4p8/4.2.6p5

Edit:

Controller:

# ntpq -npcrv
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+66.79.167.34    129.6.15.28      2 u  933 1024  377   50.360    3.898   5.064
-208.53.158.34   164.244.221.197  2 u  372 1024  377   27.384    6.635   5.323
+173.230.158.30  199.102.46.73    2 u  780 1024  357   47.656    0.897   0.596
*129.250.35.251  209.51.161.238   2 u  373 1024  377   40.828    1.786   0.163
 127.127.1.0     .LOCL.          10 l  84d   64    0    0.000    0.000   0.000
associd=0 status=0615 leap_none, sync_ntp, 1 event, clock_sync,
version="ntpd 4.2.6p3@1.2290-o Tue Jun  5 20:12:08 UTC 2012 (1)",
processor="x86_64", system="Linux/3.2.0-54-generic", leap=00, stratum=3,
precision=-22, rootdelay=48.228, rootdisp=69.214, refid=129.250.35.251,
reftime=d6f049cf.5ce03f06  Wed, Apr  9 2014 22:35:59.362,
clock=d6f04f81.183edd61  Wed, Apr  9 2014 23:00:17.094, peer=21729,
tc=10, mintc=3, offset=1.514, frequency=12.879, sys_jitter=1.158,
clk_jitter=0.896, clk_wander=0.058

Instance:

$ ntpq -npcrv
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+10.240.0.81     129.250.35.251   3 u 1997 1024  376    0.461   -2.098   0.194
+72.3.128.240    204.9.54.119     2 u 1556 1024  376    0.677    2.234   4.023
*72.3.128.241    204.9.54.119     2 u 1664 1024  376    0.793   -0.783   0.836
 127.127.1.0     .LOCL.          10 l  51h   64    0    0.000    0.000   0.000
associd=0 status=06ff leap_none, sync_ntp, 15 events, stale_leapsecond_values,
version="ntpd 4.2.6p5@1.2349-o Sat Nov 23 18:21:48 UTC 2013 (1)",
processor="x86_64", system="Linux/2.6.32-431.5.1.el6.x86_64", leap=00,
stratum=3, precision=-22, rootdelay=30.593, rootdisp=105.114,
refid=72.3.128.241,
reftime=d6f04951.9026bd89  Wed, Apr  9 2014 22:33:53.563,
clock=d6f04fd1.0d15b2be  Wed, Apr  9 2014 23:01:37.051, peer=54008,
tc=10, mintc=3, offset=-0.295, frequency=-0.163, sys_jitter=1.914,
clk_jitter=0.918, clk_wander=0.080, tai=35, leapsec=201207010000,
expire=201306280000
Sammitch
  • 2,111
  • 1
  • 21
  • 35
  • Since you are having problems with your controllers, can you fire up tcpdump and capture ntp on the controllers and clients? Are you seeing all the requests from the clients on the controllers? – Zoredache Apr 09 '14 at 21:07
  • Can you append `ntpq -pcrv` from a host and a controller node? You hav verified that there are no other FW rules in play? FYI all of your restrict server lines are redundant. They do not really do anything that is not included in the default restrict lines and make reading your config painful. peering is different than requesting time. If you really want different restrictions for your servers you can use `restrict source` once and it will cover all associations. – dfc Apr 09 '14 at 22:30
  • @Zoredache fired it up on the controller and an instance, and it looks like what I thought was one private network is actually two. One virtual and one actual. The controller is seeing requests only from the compute nodes and not the individual instances. All the requests I generated seemed to be answered as expected, but the troubles come in spurts. I'll fire up tcpdump again once it flares up again. – Sammitch Apr 09 '14 at 22:54
  • @dfc yep, most of that config is redundant trash, but that's what the Opscode NTP cookbook generates so I'm stuck with it. I'm adding the requested output now... – Sammitch Apr 09 '14 at 23:00
  • Everything looks good from what you posted. Maybe check_mk is braindead? Can you post the ntpq -crv when the problem arises again? PS your leapseconds file is stale. – dfc Apr 09 '14 at 23:28
  • If the servers fall back to the stratum 10 local clock it is hardly an issue if the monitoring, just because it noticed and did alert. :> – Florian Heigl Jun 04 '14 at 15:02

0 Answers0