0

DNS lookup on some of my EC2 instances have been failing intermittently. A reboot resolves the problem but it goes back to the same fail-state after a few hours (or after a few days), and remains in that state until a reboot happens

When the failure happened, I tried to resolve www.google.com using 8.8.8.8. The output is as follows:

# dig @8.8.8.8 www.google.com

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.amzn2.5.2 <<>> @8.8.8.8 www.google.com
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

I ran a tcpdump in parallel while running dig. From the output, I could see that the nameserver was sending a response. I therefore assumed that the OS was discarding the response

# tcpdump -i eth0 udp and port 53 -vvv
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
07:02:18.312658 IP (tos 0x0, ttl 254, id 36162, offset 0, flags [DF], proto UDP (17), length 76)

(I've removed additional lines from output)

    my_hostname.54159 > 8.8.8.8.domain: [udp sum ok] 12088+ [1au] A? www.google.com. ar: . OPT UDPsize=4096 (43)
07:03:29.274714 IP (tos 0x0, ttl 255, id 8454, offset 0, flags [DF], proto UDP (17), length 66)
    my_hostname.35356 > 10.210.148.199.domain: [udp sum ok] 28668+ PTR? 8.8.8.8.in-addr.arpa. (38)
07:03:29.277401 IP (tos 0x0, ttl 128, id 7424, offset 0, flags [DF], proto UDP (17), length 90)
    10.210.148.199.domain > my_hostname.35356: [udp sum ok] 28668 q: PTR? 8.8.8.8.in-addr.arpa. 1/0/0 8.8.8.8.in-addr.arpa. [5m] PTR dns.google. (62)
07:03:29.279305 IP (tos 0x0, ttl 115, id 5157, offset 0, flags [none], proto UDP (17), length 167)
    8.8.8.8.domain > my_hostname.54159: [udp sum ok] 12088 q: A? www.google.com. 6/0/1 www.google.com. [5m] A 172.253.122.104, www.google.com. [5m] A 172.253.122.106, www.google.com. [5m] A 172.253.122.99, www.google.com. [5m] A 172.253.122.103, www.google.com. [5m] A 172.253.122.105, www.google.com. [5m] A 172.253.122.147 ar: . OPT UDPsize=512 (139)

(I've removed additional lines from output)


^C
3547 packets captured
4276 packets received by filter
729 packets dropped by kernel

Using dig over TCP works as expected

# dig +tries=1 @8.8.8.8 www.google.com +short +vc
172.253.122.103
172.253.122.106
172.253.122.105
172.253.122.147
172.253.122.104
172.253.122.99

I checked iptables. The service was in an inactive state and there was nothing in the rules that suggested that it is at fault

# iptables -S
-P INPUT ACCEPT
-P FORWARD DROP
-P OUTPUT ACCEPT

I also looked at a few more things about the network:

# ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 10.210.151.34  netmask 255.255.255.128  broadcast 10.210.151.127
        ether 0e:6a:ac:cb:2a:f9  txqueuelen 1000  (Ethernet)
        RX packets 4529869  bytes 552070750 (526.4 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 4475269  bytes 756543406 (721.4 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

# netstat -suna
IcmpMsg:
    InType0: 25
    InType3: 1615
    OutType3: 1910
    OutType8: 33
Udp:
    367075 packets received
    1918 packets to unknown port received.
    466742 packet receive errors
    838332 packets sent
    0 receive buffer errors
    0 send buffer errors
UdpLite:
IpExt:
    InOctets: 928068155
    OutOctets: 1678778245
    InNoECTPkts: 7967209
    InECT0Pkts: 10918

# sysctl net.core.rmem_max
net.core.rmem_max = 16777216

# sysctl net.ipv4.udp_mem
net.ipv4.udp_mem = 382056   509411  764112

# uptime
 08:02:13 up 3 days, 18 min,  1 user,  load average: 0.00, 0.00, 0.00

Is packet receive errors a smoking gun?

How can I resolve the DNS lookup failure (without going for a reboot)?

SJH
  • 1
  • 1
  • Although I would assume that then you wouldn't see packets in TCP dumps either: in addition to a local host based firewall you may also have an external firewall solution active. Either a "real" firewall or the logical equivalent, such as an Security Group. – Rob Jul 10 '22 at 11:52
  • Since a reboot corrects the issue, temporarily even, it would seem like this is an issue with the OS, no? I did get AWS support to troubleshoot and they found nothing out of the ordinary in the AWS infrastructure or configuration, that could be causing this – SJH Jul 10 '22 at 13:13

0 Answers0