0

I have a problem with my DNS setup and I really can't understand what's wrong with it.

We host our own public DNS servers on 2 different networks (technically, we have 3 DNS servers on 3 different IP ranges but only 2 are in different physical locations).

This weekend, the primary DNS server for one of our domains hung (I have no idea why yet, but it's a different matter). Strangely, this caused all external DNS requests for that domain to fail. Now, it's always been my impression than the whole point of providing multiple DNS servers was that if one failed, the others would take over (or rather, that the clients would query next available NS server if the one listed in the SOA failed). Yet, until we restarted the primary server, no query would succeed even though all other DNS servers where up, running and answering properly to queries (Authoritative answer to all requests to the zone).

I've checked that the SOA is correct, that all DNS servers have properly registered glue records and that all responded to NDS queries for domains they are authoritative to.

Any idea ?

weeheavy
  • 4,089
  • 1
  • 28
  • 41
Stephane
  • 6,432
  • 3
  • 26
  • 47

3 Answers3

3

DNSstuff is reporting that not all of your name servers have glue records, which I believe would have caused the problem.

joeqwerty
  • 109,901
  • 6
  • 81
  • 172
  • He has 3 primary nameservers. The one within his own domain has glue records. So this is normal and should not cause any problems. – Joris Sep 13 '10 at 19:57
  • Unless it's the one that was down. – joeqwerty Sep 14 '10 at 01:56
  • I have checked it and, as far as I can tell, there is a glue record for each of the name servers involved. I've removed the ".com" name server in order to simplyfy the diagnostic, though: having the glue record on different TDL servers can cause trouble with the diagnostic tools (at least, it did with me). Could you possibly check it again ? – Stephane Sep 14 '10 at 12:38
  • Now it looks good. As I stated in my answer, if the previous glue records at the parent servers only pointed to the server that was down then there would be no way for DNS clients to query your other name servers because they wouldn't be able to resolve your other name servers names to query them. Glue records are required when you set the name servers of a domain name to a hostname under the domain name itself. – joeqwerty Sep 14 '10 at 12:50
  • Thanks for checking. I haven't added any glue record, only removed the NS server that wasn't in the same TDL. The missing glue record was most likely a bug of the diagnostic tool. – Stephane Sep 14 '10 at 14:55
1

If they both respond to queries in a normal situation and B stops responding if A is unreachable, it sounds like B is configured as a forward resolver for that domain.

I believe you can examine the TTL of a record in the domain to identify this situation. Truly authoritative nameservers (that'd be A in your case) would answer with the configured TTL every time, while B will probably return a lower TTL with every second.

Using dig you can direct queries to a specific server with the "@ip.addr.of.nameserver" parameter, and the TTL will show up in your answer "yourdomain.tld. 300 IN A an.ip.add.ress". On windows you'd need to consult the nslookup manual.

The cause and resolution of this problem are specific to the DNS software, please tell us the software and it's version number.

gravyface
  • 13,957
  • 19
  • 68
  • 100
Joris
  • 5,969
  • 1
  • 16
  • 13
  • Thanks for your answer. No, B isn't a forwarder for A. A is the primary server for the zone and B (and C) are secondaries. The zone files are updated through a regular zone transfer mechanism after a push notification. The software we use is SimpleDNS plus (5.2 Build 117). – Stephane Sep 13 '10 at 12:12
  • Does B stop immediately after A is unreachable, or only after some time? Is there anything in the event log for SimpleDNS? – Joris Sep 13 '10 at 12:59
  • B doesn't stop: it still answers just fine. What happens is that external clients cannot resolve anything in the zone any more. it's instantaneous and resolved immediately when A is restored. And no, there isn't anything in the DNS logs either from A or from B (not even the log of client queries in B). – Stephane Sep 13 '10 at 13:44
  • I'd be looking into the network communication between both servers while you're firing of queries to B. There is not supposed to be any realtime communication after the initial zone transfers; B should not know wheiter A is up or down. – Joris Sep 13 '10 at 14:25
  • I am NOT firing queries to B. The clients are supposed to but they don't. B is a secondary DNS server for the zone: it's listed as a NS server for the zone but it's not in the SOA record. It's a simple, classical, standard secondary server. No forwarder, etc. – Stephane Sep 14 '10 at 09:44
  • Stephane: so what you're saying now is that while A is down B is fully able to answer queries, but clients do not use it? – Joris Sep 15 '10 at 06:00
  • @Joris: yes, that's what I mean – Stephane Sep 15 '10 at 07:12
  • @Stephane: sorry,I'm at a loss. Your nameservers are responding, the soa looks correct. I presume you can see no queries during normal operation on B (= git?) either? – Joris Sep 15 '10 at 08:59
  • @Joris: Yes, that's what I'm seeing. No client eveer queries the secondary servers. – Stephane Sep 15 '10 at 12:42
  • @Stephane, I'm completely blank at this point. Can you see the queries arrive when you (or I) explicitly query the second server? I'm thinking we're looking at this entirely the wrong way. – Joris Sep 15 '10 at 13:53
  • @Joris: Yes, I can see the queries when I explicitly query the second server. – Stephane Sep 16 '10 at 06:40
  • @Stephane: I suggest you re-phrase the question and re-post. Something along the lines of "Secondary DNS is never naturally queried by the internet, but works fine explicitly" – Joris Sep 16 '10 at 12:36
  • @Joris: Ok, thanks for the suggestion. I will do that. And sorry for the confusion. – Stephane Sep 16 '10 at 13:14
1

How long was your primary server unavailable? Your SOA expire setting is fairly low (86400 or 24 hours), so if the primary was offline longer than that the secondaries would have expired the zone and information causing queries to fail.

I generally recommend expire times of seven days (604800 seconds) to allow sufficient time to fix a failed primary server, especially if nobody is on call to fix it over long holiday weekends which can run 3-4 days before someone even realizes it's broken.

Justin Scott
  • 8,798
  • 1
  • 28
  • 39
  • As an aside, your refresh time seems a bit high (86400), which is fine if you're using notify and things don't change all that often. I generally recommend an hour or three myself though. – Justin Scott Sep 13 '10 at 12:51
  • The server was unavailable for about 2 hours. You make a good point about the expiration, though: I'll look into it. No idea why client don't go query the secondaries when the primary is down, though ? – Stephane Sep 13 '10 at 12:55
  • Thanks again for your suggestions: I've changed the TTLs in the SOA. – Stephane Sep 13 '10 at 13:00