6

I have bind 9.9.5 running on two servers (CentOS 6 and 7), for caching and forwarding DNS queries for a mail server. The servers run fine for weeks, then suddenly respond to all queries with SERVFAIL. The first time it happened, both servers started failing on the same day. Now, a week later, it happened again, but only on one server. Restarting named makes the problem go away.

Here is the important bits of /etc/named.conf (full file with irrelevant bits here):

acl "trusted" {
    localhost;
    localnets;
    10.128.0.0/9;
};
options {
    listen-on port 53 { 127.0.0.1; 10.128.0.0/9; };
    listen-on-v6 port 53 { ::1; };
    directory               "/var/named";
    dump-file               "/var/named/data/cache_dump.db";
    statistics-file         "/var/named/data/named_stats.txt";
    memstatistics-file      "/var/named/data/named_mem_stats.txt";
    bindkeys-file           "/etc/named.iscdlv.key";
    managed-keys-directory  "/var/named/dynamic";
    auth-nxdomain no;
    version "asdf";

    dnssec-enable       yes;
    dnssec-validation   yes;
    dnssec-lookaside    auto;

    recursion yes;
    forward only;
    forwarders { 169.254.169.254; };

    allow-query     { trusted; };
    allow-recursion { trusted; };
};

When the server is in a failing state, a dig query response:

[q@oak3] dig @10.128.0.9 apple.com a

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.68.rc1.el6_10.1 <<>> @10.128.0.9 apple.com a
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 44811
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;apple.com.         IN  A

;; Query time: 3 msec
;; SERVER: 10.128.0.9#53(10.128.0.9)
;; WHEN: Fri Mar 15 19:22:06 2019
;; MSG SIZE  rcvd: 27

These log entries appear:

==> /var/named/chroot/var/log/queries.log <==
15-Mar-2019 19:22:06.983 client 10.128.0.4#55092 (apple.com): query: apple.com IN A + (10.128.0.9)

==> /var/named/chroot/var/log/dnssec.log <==
15-Mar-2019 19:22:06.984 validating apple.com/A: bad cache hit (com/DS)

==> /var/named/chroot/var/log/lame-servers.log <==
15-Mar-2019 19:22:06.984 broken trust chain resolving 'apple.com/A/IN': 169.254.169.254#53

After restarting named, running the same query (dig @10.128.0.9 apple.com a) responds correctly, and there are no errors in the log.

There is nothing relevant logged at the time that queries began failing under /var/logs. The server hasn't rebooted recently, no updates were installed recently.

Is there any issue with my configuration? What may cause a normally-functioning bind server to suddenly start failing?

Quinn Comendant
  • 548
  • 2
  • 17
  • 1
    Did you try having more logs? Did you try another bind version? Also remove any trace of DLV in your configuration, this feature is long gone (https://www.isc.org/downloads/bind/dlv/) hence remove `dnssec-lookaside auto;` and make sure that `/etc/named.iscdlv.key` has same content as https://ftp.isc.org/isc/bind9/keys/9.11/bind.keys.v9_11. "broken trust chain resolving" may mean a DNSSEC issue. When dig fails with SERVFAIL, does it work when you add `+cd`? If so, that is 100% a DNSSEC problem. You will probably need to look at logfiles on your forwarder too. – Patrick Mevzek Apr 17 '19 at 23:13
  • What's the size of /var/named/chroot/var/log/queries.log when server crashes? what's the output of df -h when server crashes? also, using strace when server crashes can give clues – bgtvfr Apr 19 '19 at 08:24
  • @bgtvfr the server does not crash it returns SERVFAIL which can happen for a multitude of reasons but here I am pretty sure this is because of DNSSEC. Unfortunately who do not hear more from the OP with added details. – Patrick Mevzek Apr 20 '19 at 19:40
  • @PatrickMevzek thanks, good tips. What did you mean by "Did you try having more logs"? I'm already using `severity dynamic` – I don't think I can juice more logs out of it (see link to my config in my post). I'm waiting for it to SERVFAIL again, and I'll try your other suggestions. – Quinn Comendant Apr 24 '19 at 05:14
  • 2
    Probably unrelated to your problem, but bind 9.9.4 is EOL since almost a year like all 9.9 and 9.10 see https://kb.isc.org/docs/bind-9-end-of-life-dates Except if imperious reasons you should try to upgrade first... – Patrick Mevzek Apr 24 '19 at 05:59
  • regarding logging (I did not look at your configuration in a remote link), look at https://ftp.isc.org/isc/bind9/cur/9.14/doc/arm/Bv9ARM.ch05.html#logging_statement or the example in https://stackoverflow.com/questions/11153958/how-to-enable-named-bind-dns-full-logging, you could have "severity debug 10" for plenty of debug. But again, to summarize I bet you have a DNSSEC problem, so I would either rule that out or confirm that before changing logging. – Patrick Mevzek Apr 24 '19 at 06:07

0 Answers0