3

I'm experiencing random issues on a DNS resolver running BIND 9.8.2 (RHEL6) behind a loadbalancer.

The loadbalancer is actively probing the server with a DNS query to find if it's alive, and every now and then it gets a "Connection refused" (ICMP) for port 53, which briefly makes the server unavailable for service in the loadbalancer server pool.

This happens intermittently (the LB probes every 10 seconds, and we usually see 1 or 2 failures, followed by successes) for several minutes at a time, and then stops, it happens during peak load times but it has been detected when the traffic is at the lowest.

When the issue occurs BIND is up and running, which makes the "connection refused" message confusing, I would expect that from a server with no listening service on port 53, which is not the case. Even if the DNS resolution failed for the probing record, the answer would be a NXDOMAIN or SERVFAIL, not a flat-out UDP rejection.

The query logs don't show the failing probes, which means the UDP packet is rejected before being delivered to BIND for processing.

My best guess is that this is caused by resource exhaustion of some sort, but I can't figure out what exactly. We're monitoring open file descriptors, net.netfilter.nf_conntrack_count, CPU, memory, recursive-clients, etc. and none of those indicators even approach the limits when the issue hits.

None of the log files has any relevant error messages at the times when the issue occurs.

So, my question: what networking metrics/parameters should I look into?

Configuration

/etc/sysconfig/named

OPTIONS="-4 -n10 -S 8096 "

named.conf

options {
        directory "/var/named";
        pid-file "/var/run/named/named.pid";
        statistics-file "/var/named/named.stats";
        dump-file "/var/named/named_dump.db";
        zone-statistics yes;
        version "Not disclosed";
        listen-on-v6 { any; };
        allow-query { clients; privatenets; };
        recursion yes;                             // default
        allow-recursion { clients; privatenets; };
        allow-query-cache { clients; privatenets; };
        recursive-clients 10000;
        resolver-query-timeout 5;
        dnssec-validation no;
        //querylog no;

        allow-transfer { xfer; };
        transfer-format many-answers;
        max-transfer-time-in 10;
        notify yes;                                // default

        blackhole { bogusnets; };

        // avoid reserved ports
        avoid-v4-udp-ports { 1935; 2605; 4321; 6514; range 8610 8614; };
        avoid-v6-udp-ports { 1935; 2605; 4321; 6514; range 8610 8614; };

        max-cache-ttl 10800; // 3h

        response-policy {
                zone "rpz";
                zone "rpz2";
                zone "static";
        };
        rate-limit {
                window 2; // seconds rolling window
                ipv4-prefix-length 32;

                nxdomains-per-second 5;
                nodata-per-second 3;
                errors-per-second 3;
        };
};

Edit: UDP receive errors

$ netstat -su
IcmpMsg:
    InType3: 1031554
    InType8: 115696
    InType11: 178078
    OutType0: 115696
    OutType3: 502911
    OutType11: 3
Udp:
    27777696664 packets received
    498707 packets to unknown port received.
    262343581 packet receive errors
    53292481120 packets sent
    RcvbufErrors: 273483
    SndbufErrors: 3266
UdpLite:
IpExt:
    InMcastPkts: 6
    InOctets: 2371795573882
    OutOctets: 13360262499718
    InMcastOctets: 216

Edit 2: network memory sizing

$ cat /proc/sys/net/core/rmem_max
67108864
$ cat /proc/sys/net/ipv4/udp_mem
752928  1003904 1505856

Edit 3: No issue with UdpInErrors

Had a probe failure event with no corresponding increase in UDP packet receive errors, so that seems to be ruled out as a cause.

Edit 4: there may be 2 issues at play here, some instances of failures have corresponding UdpInErrors increases, and some don't

I've found an occurrence which correlates to a problem with UDP receive errors:

UDP In traffic

UDP Errors

I have already increased the kernel memory values to:

# cat /proc/sys/net/core/wmem_max
67108864
# cat /proc/sys/net/core/rmem_max
67108864
# cat /proc/sys/net/ipv4/udp_rmem_min
8192
# cat /proc/sys/net/ipv4/udp_wmem_min
8192

It does not seem to be load-related, similar servers with the same and even bigger workloads show no issues, while at the same time the other server behind the same loadbalancer showed the exact same behaviour during that time period.

Edit 5: found some evidence of TCP issues in BIND's statistics-channels data

Found a correlation between the high UDP packet receive errors and the TCP4ConnFail and TCP4SendErr metrics from BIND's statistics-channels data.

TCP4ConnFail TCP4SendErr

While the increase does not have the same scale as the UDPInErrors, it correlates strongly with it, since this effect is not present during other times.

Edit 6: the plot thickens... syslog seems to be a contributing factor

The affected DNS servers are configured to log querylogs to syslog's local0 facility, which in turn forwards them via UDP to a remote server. This has been a cause for performance issues in the past, in fact we don't have it enabled on the busiest DNS servers for that reason.

I tried disabling querylogging and it seemed to make the udpInErrors issues disappear, so I've set up an experiment. We have 2 servers behind the same loadbalancer, I've kept serverA with querylogging active as a control, and disabled querylogging (and thus syslog's forwarding) on serverB. The issue stopped occurring on both servers at the exact same time.

Then I re-enabled querylogging on serverB, and the issue reappeared, once again on both servers!

This was tried on 2 separate occasions, once under light workload and once at the busiest time, and it manifested only at the busy time, so it seems to be in part related to load.

The graphs below show the relationship between network traffic and UDP receive errors.

ServerB serverB eth0 out serverB udpInErrors

ServerA serverA eth0 out serverB udpInErrors

The same increase effect can also be found in the TCP indicators mentioned in edit 5.

So the root cause seems to be related to networking, which cascades and creates all the symptoms shown. The tricky part is that these servers (virtual machines, actually) run on separate infrastructure (different racks, even rooms, at times) so there should not be any effect propagation between them.

It doesn't seem to be related to querylogging itself, otherwise the effects wouldn't propagate between the servers.

One thing that occurred to me is that the receiving syslog server doesn't have routing back to these servers, so we never get any reply packets (if there are any), so syslog acts on a best-effort "fire and forget" way. Could this be a cause?

André Fernandes
  • 969
  • 1
  • 10
  • 25
  • Does `netstat -su` show counts for "packet receive errors"? This counter correlates to UDP-MIB::udpInErrors, and may indicate receive buffer saturation. (between the kernel and BIND) – Andrew B Jun 13 '17 at 16:31
  • I have found a few UDP packet receive errors in the output of `netstat -su` (updated the question with that detail). I am now tracing graphs for the `UDP-MIB::udpInErrors` metric. – André Fernandes Jun 14 '17 at 16:00
  • Excellent. That SNMP counter is a crucial monitoring metric for DNS servers, though it doesn't get brought up often. – Andrew B Jun 14 '17 at 16:38

1 Answers1

2
Udp:
    27777696664 packets received
    ...
    262343581 packet receive errors

105.88 datagrams received for every 1 UDP receive error is an extremely high ratio of DNS packet loss, and it indicates that you have a bottleneck in the software. If the receive queue associated with a UDP socket is full when the kernel tries to hand off the packet, the packet is lost and this counter will increment. This means that you've lost over two hundred million packets to this phenomenon since the last reboot on this server alone.

A full receive queue means that the software is not removing packets from the kernel fast enough, creating a backlog that eventually exceeds the size of the buffer. Your next step should be determining why the queue is running so high. As an example, one of our users found that the system was saturated with iowait due to heavy disk logging. I can't provide a comprehensive guide for identifying a root cause other than suggesting that you review all of your SNMP performance metrics and looking for a correlation. If you can't find any correlation, you might be running at the full capacity of your hardware+software stack and the problem is better solved by throwing more servers at it.

As a side note, it's possible to increase the size of the receive queue, but this is best viewed as your capacity for dealing with burst traffic. If burst traffic is not why you are experiencing queue spillage, a larger queue size will cause a longer amount of processing time for the traffic that you do accept, which is probably undesired. BIND will use a receive queue depth of up to 32K, but limited to the value specified by /proc/sys/net/rmem_max. This number can be increased further by recompiling BIND with the --with-tuning=large option, which increases the potential max from 32K to 16M.

Andrew B
  • 32,588
  • 12
  • 93
  • 131
  • You bring up a few interesting points, which give me some ideas to check out, but I don't think the number is that high. The system uptime is now 191 days, this server processes between 500 and 3000 queries/second on a daily cycle, and does 50-200 outbound DNS queries. The amount of noise (bad domains, attacks, non-responding DNS servers) is also nontrivial, and this server has survived some full internet cutoff events, which may have contributed disproportionately to the error count. The fact is that it's not increasing regularly and the SNMP graphs for `udpInErrors` are at zero. – André Fernandes Jun 14 '17 at 16:47
  • Speaking as someone who operates several high-end caches nationwide, that number is pretty friggin' high. – Andrew B Jun 14 '17 at 16:48
  • sorry, jumped the gun there. I've added more context to the comment :P – André Fernandes Jun 14 '17 at 16:52
  • I understand where you're trying to come from, but those are factors for my machines as well and I operate at MSO scale. To put things in perspective, your accept:drop ratio is 106:1. I don't run as much BIND in production as I used to, but mine is 943,399:1. 363,623:1 in a very large market on a non-BIND product. The numbers you are presenting do in fact represent a significant performance problem somewhere. – Andrew B Jun 14 '17 at 17:01
  • The 1% figure did seem high, but as I mentioned we've had a few catastrophic events since the last reboot. I'll have to track the graphs for a few days to get a better feeling of what is happening. Thanks for the insight. – André Fernandes Jun 14 '17 at 17:13
  • Ah, on reread I missed the "internet cutoff" part of your edited comment. That would weigh pretty heavily. Either way you're taking the right approach, if you continue to see the LB errors without this counter increasing they are probably unrelated. Good luck! – Andrew B Jun 14 '17 at 17:16