7

We're using BIND 9.7.3 on the stable version of Debian (updated weekly), and we see some very strange behaviour for one particular domain. We host a few hundred, but this one is ours.

Basically, the secondary DNS server is trying to transfer the domain from the master. According to the logs, it succeeds in transferring the domain every time, but it always gets the serial number wrong! As a result, it keeps re-doing the transfer at every opportunity. I'm not even sure where it's getting the serial number, because the primary server reports back with the right serial number.

Here's the logs we get from the secondary (the ip 192.168.0.130 is the primary server, 192.168.0.4 is the secondary. And of course, they're not real.):

Aug 23 03:01:08 ns2 named[4242]: transfer of 'mydomain.ca/IN/external' from 192.168.0.130#53: connected using 192.168.0.4#60959
Aug 23 03:01:08 ns2 named[4242]: transfer of 'mydomain.ca/IN/external' from 192.168.0.130#53: Transfer completed: 0 messages, 1 records, 0 bytes, 0.001 secs (0 bytes/sec)

This seems pretty normal, although both hosts are set up with IPv6 addresses and technically speaking they should be using them, but that's a problem for another day (I think).

So let's query the primary server from the secondary one, and see what it says:

$ host -4 -t any mydomain.ca 192.168.0.130
Using domain server:
Name: 192.168.0.130
Address: 192.168.0.130#53
Aliases: 

mydomain.ca has IPv6 address fc00:::31
mydomain.ca has SOA record ns1.mydomain.bc.ca. hostmaster.mydomain.ca. 2011082201 900 3600 604800 86400
mydomain.ca name server ns2.mydomain.bc.ca.
mydomain.ca name server ns1.mydomain.bc.ca.
mydomain.ca mail is handled by 20 pop.mydomain.ca.
mydomain.ca has address 192.168.0.205
mydomain.ca descriptive text "v=spf1 mx ip4:192.168.0.4 ip4:192.168.0.193 ip6:fc00:::23 ip6:fc00:::12 ip6:fc00:::33 a:smtp.mydomain.ca a:webmail.mydomain.ca a:smtp2.mydomain.ca a:ns2.mydomain.ca ~all"

And then let's do the same for the secondary nameserver:

$ host -4 -t any mydomain.ca 192.168.0.4  
Using domain server:
Name: 192.168.0.4
Address: 192.168.0.4#53
Aliases: 

mydomain.ca has SOA record ns1.mydomain.bc.ca. hostmaster.mydomain.ca. 2011011013 600 600 600 600
mydomain.ca descriptive text "v=spf1 mx ip4:192.168.0.4 ip4:192.168.0.193 ip6:fc00::23 ip6:fc00::12 ip6:fc00::33 a:smtp.mydomain.ca a:webmail.mydomain.ca a:smtp2.mydomain.ca a:ns2.mydomain.ca ~all"
mydomain.ca has address 192.168.0.205
mydomain.ca mail is handled by 20 pop.mydomain.ca.
mydomain.ca name server ns1.mydomain.bc.ca.
mydomain.ca name server ns2.mydomain.bc.ca.
mydomain.ca has IPv6 address fc00::31

You can see here that the serial number is 2011011013 on the secondary, but 2011082201 for the primary. I've used the date plus a 2-digit number, so the secondary is somehow using a serial number from January. I've tried searching our configuration on both the primary and secondary servers for this serial number, but it's nowhere to be found.

Speaking of configuration, here's the configuration for this domain in /etc/bind/named.conf:

zone "mydomain.ca" { type slave; file "secondaries/mydomain.ca"; masters { 192.168.0.130; }; };

and the timestamp on secondaries/mydomain.ca is the time of the most recent update. Deleting this file still results in a serial number of 2011011013. The contents of this file are very long, but here are the headers on the secondary server:

$ORIGIN .
$TTL 3600   ; 1 hour
mydomain.ca     IN SOA  ns1.mydomain.bc.ca. hostmaster.mydomain.ca. (
            2011011013 ; serial
            600        ; refresh (10 minutes)
            600        ; retry (10 minutes)
            600        ; expire (10 minutes)
            600        ; minimum (10 minutes)
            )
        NS  ns1.mydomain.bc.ca.
        NS  ns2.mydomain.bc.ca.
        A   192.168.0.205
        MX  20 pop.mydomain.ca.
        TXT "v=spf1 mx ip4:192.168.0.4 ip4:192.168.0.193 ip6:fc00::23 ip6:fc00::12 ip6:fc00::33 a:smtp.mydomain.ca a:webmail.mydomain.ca a:smtp2.mydomain.ca a:ns2.mydomain.ca ~all"
        AAAA    fc00::31
$ORIGIN mydomain.ca.

and the headers from the equivalent file on the primary:

$TTL 1d
@       IN      SOA     ns1.mydomain.bc.ca. hostmaster.mydomain.ca. (
                    2011082302 ; serial
                    15m        ; refresh after 15 minutes
                    1h         ; retry after 1 hour
                    1w         ; expire after 1 week
                    1d )       ; negative caching TTL of 1 day.

    IN      NS      ns1.mydomain.bc.ca.
    IN      NS      ns2.mydomain.bc.ca.
    IN MX   20      pop.mydomain.ca.


@               IN      A       192.168.0.205

;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; SPF TXT records
;;;;;;;;;;;;;;;;;;;;;;;;;;;

mydomain.ca. TXT "v=spf1 mx ip4:192.168.0.4 ip4:192.168.0.193 ip6:fc00::23 ip6:fc00::12 ip6:fc00::33 a:smtp.mydomain.ca a:webmail.mydomain.ca a:smtp2.mydomain.ca a:ns2.mydomain.ca ~all"

; this next bit is for the Sender Policy Framework, if it ever really matters.
pop             TXT     "v=spf1 a -all"
pop3            TXT     "v=spf1 a -all"
smtp            TXT     "v=spf1 a -all"
webmail         TXT     "v=spf1 a -all"
horde           TXT     "v=spf1 a -all"
Shane Madden
  • 114,520
  • 13
  • 181
  • 251
Ernie
  • 5,352
  • 6
  • 31
  • 37
  • When you tried deleting the secondary's copy of the zone, did you stop the secondary server, delete it & restart? (That should force a full `AXFR` instead of just an IXFR) – voretaq7 Aug 23 '11 at 16:41
  • Yes, repeatedly. – Ernie Aug 23 '11 at 16:51
  • 1
    I might be tempted to fire up tcpdump on the secondary DNS server, delete the zone and restart bind. Load up the capture and see if it gets the zone from the correct place, and if it has valid data. – Zoredache Aug 23 '11 at 17:10
  • I've seen behavior like this when there are jnl files related to the domain. Have you already checked for those? http://www.isc.org/files/arm94_0.html#journal – polynomial Aug 24 '11 at 06:22
  • It seems odd that the log on the secondary server reports a successful transfer of 0 bytes. That doesn't sound successful at all... – NorbyTheGeek Oct 12 '11 at 20:14
  • Do you see the transfer on the master? If so, what does it say? Are you using split-horizon (answer depends on where the request comes from). Can you reach the master from the slave with ssh? Maybe there is a duplicate ip. The reason for the frequent refetched are the 10 minutes expiry. HTH – AndreasM Oct 22 '11 at 10:39
  • Do you maybe have any views defined on your primary? Like an internal view, that would be used when the primary is being accessed via private class C IP. Can you query the server via public IP address? Does it render different results? – al. Oct 25 '11 at 09:15

2 Answers2

2

check your permissions on the secondary server's directory for the zones. Can the named process write to that folder? Try deleting that secondary zone and let the transfer recreate it

S. Cobbs
  • 267
  • 1
  • 2
  • 8
0

Notice that your BIND log at the primary indicates

... Transfer completed: 0 messages, 1 records, 0 bytes, 0.001 secs (0 bytes/sec)

That's not a success confirmation. Is there any message indicating an error before this?

michele
  • 585
  • 3
  • 7