Caching, forwarding Bind 9.9.4 server works for weeks, suddenly SERVFAIL on all queries (rebooting fixes it)

Question

I have bind 9.9.5 running on two servers (CentOS 6 and 7), for caching and forwarding DNS queries for a mail server. The servers run fine for weeks, then suddenly respond to all queries with SERVFAIL. The first time it happened, both servers started failing on the same day. Now, a week later, it happened again, but only on one server. Restarting named makes the problem go away.

Here is the important bits of /etc/named.conf (full file with irrelevant bits here):

acl "trusted" {
    localhost;
    localnets;
    10.128.0.0/9;
};
options {
    listen-on port 53 { 127.0.0.1; 10.128.0.0/9; };
    listen-on-v6 port 53 { ::1; };
    directory               "/var/named";
    dump-file               "/var/named/data/cache_dump.db";
    statistics-file         "/var/named/data/named_stats.txt";
    memstatistics-file      "/var/named/data/named_mem_stats.txt";
    bindkeys-file           "/etc/named.iscdlv.key";
    managed-keys-directory  "/var/named/dynamic";
    auth-nxdomain no;
    version "asdf";

    dnssec-enable       yes;
    dnssec-validation   yes;
    dnssec-lookaside    auto;

    recursion yes;
    forward only;
    forwarders { 169.254.169.254; };

    allow-query     { trusted; };
    allow-recursion { trusted; };
};

When the server is in a failing state, a dig query response:

[q@oak3] dig @10.128.0.9 apple.com a

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.68.rc1.el6_10.1 <<>> @10.128.0.9 apple.com a
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 44811
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;apple.com.         IN  A

;; Query time: 3 msec
;; SERVER: 10.128.0.9#53(10.128.0.9)
;; WHEN: Fri Mar 15 19:22:06 2019
;; MSG SIZE  rcvd: 27

These log entries appear:

==> /var/named/chroot/var/log/queries.log <==
15-Mar-2019 19:22:06.983 client 10.128.0.4#55092 (apple.com): query: apple.com IN A + (10.128.0.9)

==> /var/named/chroot/var/log/dnssec.log <==
15-Mar-2019 19:22:06.984 validating apple.com/A: bad cache hit (com/DS)

==> /var/named/chroot/var/log/lame-servers.log <==
15-Mar-2019 19:22:06.984 broken trust chain resolving 'apple.com/A/IN': 169.254.169.254#53

After restarting named, running the same query (dig @10.128.0.9 apple.com a) responds correctly, and there are no errors in the log.

There is nothing relevant logged at the time that queries began failing under /var/logs. The server hasn't rebooted recently, no updates were installed recently.

Is there any issue with my configuration? What may cause a normally-functioning bind server to suddenly start failing?

Did you try having more logs? Did you try another bind version? Also remove any trace of DLV in your configuration, this feature is long gone (https://www.isc.org/downloads/bind/dlv/) hence remove `dnssec-lookaside auto;` and make sure that `/etc/named.iscdlv.key` has same content as https://ftp.isc.org/isc/bind9/keys/9.11/bind.keys.v9_11. "broken trust chain resolving" may mean a DNSSEC issue. When dig fails with SERVFAIL, does it work when you add `+cd`? If so, that is 100% a DNSSEC problem. You will probably need to look at logfiles on your forwarder too. — Patrick Mevzek, Apr 17 '19 at 23:13
What's the size of /var/named/chroot/var/log/queries.log when server crashes? what's the output of df -h when server crashes? also, using strace when server crashes can give clues — bgtvfr, Apr 19 '19 at 08:24
@bgtvfr the server does not crash it returns SERVFAIL which can happen for a multitude of reasons but here I am pretty sure this is because of DNSSEC. Unfortunately who do not hear more from the OP with added details. — Patrick Mevzek, Apr 20 '19 at 19:40
@PatrickMevzek thanks, good tips. What did you mean by "Did you try having more logs"? I'm already using `severity dynamic` – I don't think I can juice more logs out of it (see link to my config in my post). I'm waiting for it to SERVFAIL again, and I'll try your other suggestions. — Quinn Comendant, Apr 24 '19 at 05:14
Probably unrelated to your problem, but bind 9.9.4 is EOL since almost a year like all 9.9 and 9.10 see https://kb.isc.org/docs/bind-9-end-of-life-dates Except if imperious reasons you should try to upgrade first... — Patrick Mevzek, Apr 24 '19 at 05:59
regarding logging (I did not look at your configuration in a remote link), look at https://ftp.isc.org/isc/bind9/cur/9.14/doc/arm/Bv9ARM.ch05.html#logging_statement or the example in https://stackoverflow.com/questions/11153958/how-to-enable-named-bind-dns-full-logging, you could have "severity debug 10" for plenty of debug. But again, to summarize I bet you have a DNSSEC problem, so I would either rule that out or confirm that before changing logging. — Patrick Mevzek, Apr 24 '19 at 06:07

Caching, forwarding Bind 9.9.4 server works for weeks, suddenly SERVFAIL on all queries (rebooting fixes it)

0 Answers0