Name Server change & propagation works partially : Success for US and fail for India

Question

Our domain is hosted with Enom. DNS records are managed under Enom's reseller for India called Pugmarks. We want to switch DNS record management service from Enom/Reseller over to AWS Route53, however, retaining Enom as the domain registrar.

TTL for domain's DNS records is at 300 (5 mins). I have checked TTL for name servers and found it to be 3600 (1 hr).

When we replaced Enom Name Servers with Route53 ones, Enom stopped resolving for the domain instantly. Following which ISP DNS servers followed suit after TTL expired. Our website traffic dropped (as observed in Google Analytic). This impact is understood.

A while later, upon querying for NS record for the domain through a Public/Open Name Servers such as: 4.2.2.2 -- 4.2.2.6 and 8.8.8.8 & 8.8.4.4, we get the updated records pointing to Route53: i.e

dig NS <domain.com> @8.8.4.4.

The above command shows Route53 name server records. Similarly, all other records successfully show up (A, CNAME etc.) indicating that Name Server change is successfully acquired by these DNS servers. At this point we observe US traffic scaling in Google Analytic.

But, Indian traffic still remains zero. I have queried a couple of DNS servers from two different Indian ISP (not-open to public/restricted to ISP users). These do not return any records. We waited for 4 hours for ISP to catch up with change of records, but in vain.

It is weird that US region was able to get new records, while none of the Indian ISP we tried (at least 5 of them) could pick the change. Every other DNS test tools on the web was able to pick the change except the ISP here. Resulting in a big dip in traffic which is a major concern since it is the audience that the site targets.

After 4 hours of wait-&-watch, we switched the entries back to Enom Name Servers. In matter of seconds, Indian ISP was able to resolve records, as if it was always querying Enom servers for records, even though TTL is for 1 hr. (Route53 would continue to resolve, so US traffic remained unchanged)

I have two doubts:

Indian ISP is caching NS for the domain for more than 1 hr, probably for 48 hrs
Some issue pertaining to Indian region that I have no clue about.

Point 1 is a prime suspect as far as I am concerned. Here is a link that gives details about the domain. It shows parent name server as 48 hr TTL while local name server is 1 hr TTL. Could this be causing the issue?

I want to move DNS management over to Route53 and I cannot have a downtime for over 6 hrs. We have tried up to 4 hrs in vain.

Why is this happening and what is the way out?

One alternative, perhaps, is keeping all its DNS records to 49hrs TTL (TTL greater than TTL for NS record at parent) and then switch Name Servers after record propagation of this TTL change. However, it is not foolproof, can be tried though.

The TTL is not the problem here. The Indian ISPs are the problem. — Michael Hampton, Feb 02 '15 at 16:16
@MichaelHampton Could it be something to do with AWS Route53 ? 4 Hours is long time for at least one of the ISP to pick name server change. None does! — anup, Feb 17 '15 at 02:46

score 2 · Answer 1 · answered Apr 01 '16 at 19:46

(This is an old question, but still deserves a reply)

Apparently what you did was this: You prepared the new name servers to authoritatively answer quesirws for your domain. Then you switched registration (i.e., changed the NS entries for dnsindia.com at the parent DNS servers responsible for com to point to the new DNS servers); at the same moment the old name servers stopped replying to queries about dnsindia.com (or replied with NXDOMAIN or something).

As a consequence, the impact - especially for your main audience - was the following: After one 1hr, any data cached at DNS resolvers at Indian ISPs aged out - but only data for your entries, such as A records for www.india.com. Hence the resolvers would try to query the appropriate name servers for fresh data. However, the info which server to query had not aged out yet: That info came from the com zone and had a TTL of 48hrs (so probably still up to 47hrs, let's say 24hrs on average); as this refers to the now defunct DNS servers at the old provider, failures occur as you observed. On the other hand, querying a remote resolver would succeed as it would be unlikely to have a cached copy of the parent NS records.

How to do it properly? The following strategies are possible (in decreasing order of preference):

a) Ensure that the old DNS servers keep serving your zone for at least 48 hours after transition (the parent TTL), but not much longer. Actually, this is the method I have used most of the time; the old server admin just has to remember to remove the zones at a later date.

b) Ensure that the old DNS servers at allow recursive queries (at least for your zone and at least for 48 hours); note that servers that are "official" DNS servers for some zone typically do not allow recursive queries

c) Before moving zones, change your local TTL for all records to 96hrs, say. Then wait 48hrs before doing the move. This way, resolvers should typically have a copy of your DNS records in cache that survives longer than the obsoleting NS records. This method is not perfect and becomes problematic especially if there are "cross-references" between domains or if there are records that are queried less often than the main records.

d) Alternatively, before moving zones decrease the parent TTL to 1hr (or to as much downtime as you deem acceptable), wait 48hrs and do the move. Howevre, it may not be possible to change the TTL to such a low value at the parent zone.(they don't want to be queried so often) and even if so, you'd have to consider their zone update schedules

I think a) is the main thing to take away from this and I would actually disagree with b), as this doesn't seem like it would really serve any purpose. Recursion only happens if the client requests it (by setting the RD bit in the query), the clients (resolver servers) sending queries to authoritative servers do not request recursion. — Håkan Lindqvist, Apr 01 '16 at 20:25
Setting up the old servers with slave zones during the transition would make much more sense. — Håkan Lindqvist, Apr 01 '16 at 20:35
(a)=> I asked. Enom declined to do so. (b)=> Unsure about this. I guess its nothing to do with recursive query. (c)=> This is exactly what I am left with (mentioned in the last para of my question). The exercise was not repeated fearing loss of traffic/interest. In fact, quite the opposite was done without my knowledge (i.e. reduced TTL for records to minimum possible and then change NS. It was a doom! Idea to reduce TTL is to enable a quick recovery if things didn't work :) ). (d)=> True! They declined to do that as well. — anup, Apr 27 '16 at 12:32
@HåkanLindqvist We have to access to the Name Servers. Only access is a web-based tool where we can either retain the default (Enom) or add custom NS. If custom is selected, the default enom NS is no longer authoritative. Catch 22 situation ! — anup, Apr 27 '16 at 12:44
@anupo You also don't have the option to configure the default serves **plus** new servers? — Hagen von Eitzen, Apr 27 '16 at 15:20

Name Server change & propagation works partially : Success for US and fail for India

1 Answers1