We are doing API calls to Docusign, which fail occasionally with "getaddrinfo: Name or service not known" errors. Investigating further, we see that when we connect, name resolution fails sometimes, but only from our West datacenter location. Seems the GLB DNS for the US West can take a very long time to resolve, causing DNS client timeouts when it takes >10s to look up the address.
$ dig @1.1.1.1 www.docusign.net
; <<>> DiG 9.9.5-3ubuntu0.8-Ubuntu <<>> @1.1.1.1 www.docusign.net
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49468
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1452
;; QUESTION SECTION:
;www.docusign.net. IN A
;; ANSWER SECTION:
www.docusign.net. 22 IN CNAME www-geo.docusign.net.akadns.net.
www-geo.docusign.net.akadns.net. 22 IN CNAME www-west.docusign.net.akadns.net.
www-west.docusign.net.akadns.net. 22 IN A 162.248.184.27
;; Query time: 1 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Thu Oct 04 13:16:43 EDT 2018
;; MSG SIZE rcvd: 126
Above is a good result, which took 1msec (cached)
$ dig @1.1.1.1 www.docusign.net
; <<>> DiG 9.9.5-3ubuntu0.8-Ubuntu <<>> @1.1.1.1 www.docusign.net
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 21193
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1452
;; QUESTION SECTION:
;www.docusign.net. IN A
;; ANSWER SECTION:
www.docusign.net. 6 IN CNAME www-geo.docusign.net.akadns.net.
www-geo.docusign.net.akadns.net. 6 IN CNAME www-west.docusign.net.akadns.net.
www-west.docusign.net.akadns.net. 6 IN A 162.248.184.27
;; Query time: 2725 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Thu Oct 04 13:21:29 EDT 2018
;; MSG SIZE rcvd: 126
This one is worse, as it took nearly 3s. During testing we've seen this go over 12s which will time out a lot of DNS clients and requesting apps.
Since the TTL is set to 30s, that means that every 30 seconds we have a chance at getting a timeout, our app generating errors, then a DNS success results in resumption of service. Unfortunately, this shows up as an error to our customers in our app.
We're able to work around this using hacks, but am curious if anyone else is seeing this, and how you've worked around it. Also, it might be good for people at docusign/akamai to look into why the performance of the www-west.docusign.net.akadns.net record is so bad.