BIND - Slow Query Responses Isolated To Single Interface

Question

Currently experiencing slow query responses on a specific interface on my nameserver. I'm running BIND on a physical server with one network card. This network card is leveraged by the interface eth0, and also by the virtual interface eth0:1. They both have an address in the same subnet.

BIND is listening on all IPv4 interfaces, and has some very basic options set. There are no other performance / network related options set in any other included configuration file.

listen-on { any;};
listen-on-v6 port 53 { ::1; };
directory       "/var/named";
dump-file       "/var/named/data/cache_dump.db";
statistics-file "/var/log/named/named.stats";
memstatistics-file "/var/named/data/named_mem_stats.txt";

When I query against the address on the primary interface eth0, I get a delayed response of around three seconds or above normally. This even applies when querying from the box itself against the address (not the loopback). When querying the other private IP address assigned to the virtual interface eth0:1, no problem with performance is encountered and the response is always under one second.

Analysing performance statistics, it would seem that the box is not under load and memory isn't being maxed out. I've also got another nameserver set up as a slave to this one, on the same network with a near identical network setup bar addressing, and have no performance problems querying it's main interface (with it also having a virtual interface with identical configuration). Zones I'm querying for are authoritative, so there is no delay in looking up the records elsewhere. I'm also able to confirm that the query is received almost instantly by the server regardless of where it originates from, and the delay occurs between the query being received and a response being sent (identified through tcpdump).

If there's any information that would be useful to have, please rather than downvoting me for missing it in my post, please just leave a comment below and I'm happily provide any helpful details I can. Any suggestions on how best to troubleshoot a problem of this nature, or ideas on what the potential causes could be, would be very much appreciated.

BIND version is 9.3.6-P1-RedHat-9.3.6-25.P1.el5_11.11. I've recently updated to this, but I'm unsure whether these performance issues came about following the upgrade, or whether they existed prior to it.

EDIT: Dig output as requested. Removed domain name being queried and target server.

Also worth noting that sometimes the requests do just timeout completely. It's quite intermittent, with occasional replies under two seconds, but mostly over three with the occasional timeout as mentioned.

[root@hugh-host-01 ~]# dig REMOVED @REMOVED

; <<>> DiG 9.9.4-RedHat-9.9.4-38.el7_3 <<>> REMOVED @REMOVED
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 52129
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 3, ADDITIONAL: 4
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;REMOVED.                      IN      A

;; ANSWER SECTION:
REMOVED.               5       IN      A       REMOVED

;; AUTHORITY SECTION:
REMOVED.               5       IN      NS      REMOVED.
REMOVED.               5       IN      NS      REMOVED.
REMOVED.               5       IN      NS      REMOVED.

;; ADDITIONAL SECTION:
REMOVED.           5       IN      A       REMOVED
REMOVED.           5       IN      A       REMOVED
REMOVED.           5       IN      A       REMOVED

;; Query time: 3633 msec
;; SERVER: REMOVED#53(REMOVED)
;; WHEN: Sat Jan 07 00:49:01 GMT 2017
;; MSG SIZE  rcvd: 155

Thanks for your time,

Hugh

Are views involved? You mentioned that these servers are authoritative, but it still wouldn't hurt to see the `dig` output of your tests. — Andrew B, Jan 06 '17 at 21:09
Hey Andrew, added an output of a dig query against one of the authoritative servers for a domain. Also added a bit of extra info on the initial post. Thanks for the response. Views are involved, but were also not around not so long ago and the problem was still present - unlikely that they are related. Also as the other servers have the views, it seems to reinforce my suspicions. — Hugh, Jan 07 '17 at 00:57
Thanks. When you run the queries from the same server, are you using a name alias (such as an `/etc/hosts`) which maps to that IP? Or are you typing out the IP? — Andrew B, Jan 07 '17 at 01:03
I'm specifying the domain name in full, so hugh.com for example, and running it against the private IP address assigned to the troublesome interface. I've not got access to the box at present, but there is no reason why there should be any hosts entries for any of the domains I've done test look-ups against. Plus, as it's dig, I imagine it communicates over port 53 using the BIND service and requires information to be served from BIND records - unless dig can go against you declaring a specific server to look-up against with @. — Hugh, Jan 07 '17 at 01:11
Sorry ignore my response, misunderstood. I'm specifying the IP address of the troublesome interface when running tests locally. — Hugh, Jan 07 '17 at 01:13
Yeah, that's what I was after. (and you're right about `/etc/hosts` not mattering in that context anyway, I'm tired) That rules out issues with resolving the @target, and failed attempts to run queries against the IPv6 name associated with the @target. That puts us between the socket and the software at that point. I assume nothing stands out in the logs? Do you see the counter for `netstat -su | grep errors` incrementing? — Andrew B, Jan 07 '17 at 01:18
Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/51343/discussion-between-andrew-b-and-hugh). — Andrew B, Jan 07 '17 at 01:41

score 4 · Accepted Answer · edited Jan 08 '19 at 10:17

This problem was caused by the iowait being maxed out on the server. It was running consistently at 100% with kjournald as the service causing it.

Thanks to a suggestion from Andrew B, I begun investigating UDP packet errors using netstat -su | grep errors. From this, I could see that it was shooting up by 30 - 50 packets every second roughly. This led me to check the buffer for UDP per socket by running netstat -uanp. From this, I was able to confirm that the random delays and occasional timeouts (drops) were occurring due to the buffer being full. I found out the buffer was full by analysing the value in the Recv-Q column for the BIND service listening on the IP / port in question.

After identifying the buffer was full, there wasn't much point in increasing it as it would no doubt become saturated once more. Instead, as the CPU load and RAM looked okay, I started wondering if disk operations could be causing a bottle-neck for UDP packets being dealt with. This was confirmed by running the commmand top and analysing the iowait value.

Once I identified that the CPU was waiting nearly 100% of the time for io operations to complete, I started using tools such as iotop to find what was writing to disk. Turned out the journal system for the ext3 file-system was generating all of the waiting. This led me to think that perhaps it was extreme amounts of logging on the server that could be causing the saturation, as I knew the /var/log/messages file was receiving a ton of denied query logs every second.

Testing the above theory, I added the following line to named.conf within the logging area. This line disables logging for approve / deny messages related to received queries. There is a log per query put into /var/log/messages, which can be a lot if you get barraged by clients:

category security { null; };

Fortunately, after restarting BIND, I could see a dramatic drop in iowait percentage. Testing queries, I was able to confirm that they are being answered well within a tenth of a second now; a dramatic improvement on before.

In hindsight I should have checked the iowait time initially. Hope this helps anyone experiencing similar issues. I'll now be looking into controlling the logging a little more and seeing what I can do about these denied messages.

Whoops! Sorry for abandoning that chat, for some reason I thought I'd get notifications of activity. Glad to hear that you've fixed the problem though. :) — Andrew B, Feb 27 '17 at 16:12
Thanks Andrew, no problem whatsoever. Appreciated the nudge in the right direction greatly. — Hugh, Feb 28 '17 at 10:01

BIND - Slow Query Responses Isolated To Single Interface

1 Answers1

Linked