AWS AmazonProvidedDNS appears to not respecting TTL - can we do anything?

Question

We have several servers on AWS EC2 that are not obeying TTL values from DNS. Route tables are set up to us "AmazonProvidedDNS" It appears the the "AmazonProvidedDNS" limits TTL to 60 seconds.

Q: Is this caused by AWS DNS server adjusting the TTL in transit, and is there anything we can do about it?

Notes: - We have employed dnsmasq for now with a min-expiry-ttl of 300; this is not ideal as we'd prefer to obey the TTL rules - Running Centos7, official AMI - but I don't think that's relevant.

Evidence to back up the question.

These tests were ran on a domain we have in Route 53 we have a CNAME TTL as 300 seconds. (Outputs below searched and replaced with example ; tests were ran against a real domain we control.)

Have five output below that prove it's AWS DNS:

1) Running the official Centos7 AMI, with no modifications.

This shows incorrect TTL of 60 seconds:

dig www.example.com

; <<>> DiG 9.9.4-RedHat-9.9.4-61.el7 <<>> www.example.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 9532
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.example.com.            IN      A

;; ANSWER SECTION:
www.example.com.     60      IN      CNAME   example-645584916.us-east-1                                                                                                    .elb.amazonaws.com.
example-645584916.us-east-1.elb.amazonaws.com. 60 IN A 52.0.228.53
example-645584916.us-east-1.elb.amazonaws.com. 60 IN A 18.232.11.127

;; Query time: 391 msec
;; SERVER: 10.131.0.2#53(10.131.0.2)
;; WHEN: Wed Jul 25 01:04:00 UTC 2018
;; MSG SIZE  rcvd: 140

2) Running the same AMI, with dnsmasq set up but using pointing at AWS DNS as parent.

This shows incorrect TTL of 60 seconds:

dig www.example.com

; <<>> DiG 9.9.4-RedHat-9.9.4-61.el7 <<>> www.example.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 57290
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.example.com.            IN      A

;; ANSWER SECTION:
www.example.com.     60      IN      CNAME   example-645584916.us-east-1.elb.amazonaws.com.
example-645584916.us-east-1.elb.amazonaws.com. 60 IN A 52.0.228.53
example-645584916.us-east-1.elb.amazonaws.com. 60 IN A 18.232.11.127

;; Query time: 276 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Wed Jul 25 01:03:07 UTC 2018
;; MSG SIZE  rcvd: 140

3) Running the same AMI, with dnsmasq set up but using pointing at AWS DNS as parent, with min-cache-ttl.

First request shows incorrect TTL of 60 seconds (as this will have come from AWS), second request shows "min-cache-ttl" of 300 seconds:

dig www.example.com

; <<>> DiG 9.9.4-RedHat-9.9.4-61.el7 <<>> www.example.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 26595
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.example.com.            IN      A

;; ANSWER SECTION:
www.example.com.     60      IN      CNAME   example-645584916.us-east-1                                                                                                                                      .elb.amazonaws.com.
example-645584916.us-east-1.elb.amazonaws.com. 60 IN A 52.0.228.53
example-645584916.us-east-1.elb.amazonaws.com. 60 IN A 18.232.11.127

;; Query time: 280 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Wed Jul 25 01:25:31 UTC 2018
;; MSG SIZE  rcvd: 140

dig www.example.com

; <<>> DiG 9.9.4-RedHat-9.9.4-61.el7 <<>> www.example.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50913
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.example.com.            IN      A

;; ANSWER SECTION:
www.example.com.     289      IN      CNAME   example-645584916.us-east-1.elb.amazonaws.com.
example-645584916.us-east-1.elb.amazonaws.com. 289 IN A 18.232.11.127
example-645584916.us-east-1.elb.amazonaws.com. 289 IN A 52.0.228.53

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Wed Jul 25 01:29:02 UTC 2018
;; MSG SIZE  rcvd: 143

4) Running the same AMI, with dnsmasq set up (but using pointing at Google DNS as parent).

This shows correct TTL of 300 seconds:

dig www.example.com

; <<>> DiG 9.9.4-RedHat-9.9.4-61.el7 <<>> www.example.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36048
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;www.example.com.            IN      A

;; ANSWER SECTION:
www.example.com.     299     IN      CNAME   example-645584916.us-east-1                                                                                          .elb.amazonaws.com.
example-645584916.us-east-1.elb.amazonaws.com. 59 IN A 18.232.11.127
example-645584916.us-east-1.elb.amazonaws.com. 59 IN A 52.0.228.53

;; Query time: 295 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Wed Jul 25 01:07:15 UTC 2018
;; MSG SIZE  rcvd: 140

5) Running a local Centos7 pointing at our own DNS.

This shows correct TTL of 300 seconds:

dig www.example.com

; <<>> DiG 9.9.4-RedHat-9.9.4-61.el7 <<>> www.example.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7307
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 13, ADDITIONAL: 27

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.example.com.            IN      A

;; ANSWER SECTION:
www.example.com.     300     IN      CNAME   example-645584916.us-east-1.elb.amazonaws.com.
example-645584916.us-east-1.elb.amazonaws.com. 60 IN A 52.0.228.53
example-645584916.us-east-1.elb.amazonaws.com. 60 IN A 18.232.11.127

;; Query time: 343 msec
;; SERVER: 10.72.73.31#53(10.72.73.31)
;; WHEN: Wed Jul 25 10:41:02 AEST 2018
;; MSG SIZE  rcvd: 936

AWS services that support failover, etc. have a TTL of 60 seconds. A CNAME has a TTL and points to another record with a TTL. Which TTL should the client resolve? I could not find a spec with that answer. Note: you should not use CNAME records with AWS Load Balancers. Use A-Alias Records. — John Hanley, Jul 25 '18 at 02:21
@JohnHanley : Regarding the CNAME, both DNS requests independently have TTLs : you can see this is the final example above where the first will refresh after 300 seconds, and the other lookup has 60; both are refreshed after the independent TTLs so each request may refresh one, none or both. But this highlights the problem using min-cache-ttl in dnsmasq as this affects both : you can see this in the example 3. We want to obey the TTLs right down the line, 60 seconds for AWS (or 15 in some cases!) and 300 or more for ours. — Robbie, Jul 25 '18 at 03:29
Thanks for note about A-Alias: we are using that correctly these days. That snapshot was one of our first from a while ago, and I should change it over but that site's 100% live and I'm concerned about getting it wrong! — Robbie, Jul 25 '18 at 03:30

AWS AmazonProvidedDNS appears to not respecting TTL - can we do anything?

0 Answers0