curl & wget cannot resolve internal dns names within the aks cluster but nslookup , host , dig work fine

Question

I have a managed kubernetes instance on azure. I am very sure that the core dns is working and the dns pods are healthy.

I have a couple of services

frontend-service with one pod - Image [nginx-alpine] which has the static frontend files.
backend-service , with one pod - Image [ubuntu:20.04] which has the nodejs code.

I am unable to resolve the internal dns service names like frontend-service OR frontend-service.default.svc.cluster.local from the pods of the backend but nslookup , host , dig of the internal dns names resolve to the correct address. The backend pods are also able to resolve the external dns names like google.com.

curl http://frontend-service
curl: (6) Could not resolve host: frontend-service

curl http://frontend-service.default.svc.cluster.local
curl: (6) Could not resolve host: frontend-service.default.svc.cluster.local

wget frontend-service
--2020-08-31 23:36:43--  http://frontend-service
Resolving frontend-service (frontend-service)... failed: Name or service not known.
wget: unable to resolve host address 'frontend-service'

/etc/nsswitch.conf shows the below :

passwd:         files
group:          files
shadow:         files
gshadow:        files

hosts:          files dns
networks:       files

protocols:      db files
services:       db files
ethers:         db files
rpc:            db files

Everything works fine while trying to resolve the backend-service internal dns name from the pods of frontend service.

After some debugging and looking at the logs of coredns and the strace , I see that no call is happening to the coredns pods while doing a curl , but I can see the entry while doing an nslook up.

I also. verified that the /etc/resolv.conf has the correct configuration.

nameserver 10.3.0.10
search default.svc.cluster.local svc.cluster.local cluster.local tdghymxumodutbxfnz5m2elcog.bx.internal.cloudapp.net
options ndots:5

strace does not show any entry to search for /etc/resolv.conf , so curl is not checking for /etc/resolv.conf.

Edit 1

From the backend service pod :
dig frontend-service [It is able to resolve to the correct name server.]


; <<>> DiG 9.16.1-Ubuntu <<>> frontend-service
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 13441
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; OPT=65436: 87 a1 ee 81 04 d8 5a 49 be 0e c4 ed 1d d8 27 41 ("......ZI......'A")
;; QUESTION SECTION:
;frontend-service.            IN      A

;; AUTHORITY SECTION:
.                       30      IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2020083101 1800 900 604800 86400

;; Query time: 20 msec
;; SERVER: 10.3.0.10#53(10.3.0.10)
;; WHEN: Tue Sep 01 10:48:00 IST 2020
;; MSG SIZE  rcvd: 142

nslookup frontend-service

Server:         10.3.0.10
Address:        10.3.0.10#53

Name:   frontend-service.default.svc.cluster.local
Address: 10.3.0.30

host frontend-service     
frontend-service.default.svc.cluster.local has address 10.3.0.30

Edit 2

I wanted to test the deployment step by step with the same ubuntu:20.04 image , so I did the following.

Approach 1

I created an ephemeral pod in the cluster as below.

kubectl run -it --rm test-ubuntu --image=ubuntu:20.04 --restart=Never

Installed curl (7.68) and ran the curl http://frontend-service – This is successful.

This puzzled me , so I have removed all my build steps from Dockerfile and used only the below commands.

Approach 2

Dockerfile

FROM ubuntu:20.04
 
EXPOSE 3688
CMD [ "sleep", "infinity" ]

Pushed the image to acr and deployed the backend pods again.

kubectl exec -it <pod-name> /bin/bash

I installed curl (7.68) and ran the curl http://frontend-service – Same error – unable to resolve host.

This is surprising , same image with same content – running through kubectl run and deploying through Dockerfile , has different behaviour while running curl of same version (7.68).

I wanted to see the flow in strace in both the appraches. Please find the strace links from RUN and EXEC

strace from running curl from the ephemeral pod. https://pastebin.com/NthHQacW

strace from running curl from the pod deployed through Dockerfile https://pastebin.com/6LCE5NXu

After analysing the probing paths by running

cat strace-log | grep open

I found that the strace log from the approach 2 is missing the below lines.


2844  openat(AT_FDCWD, "/etc/nsswitch.conf", O_RDONLY|O_CLOEXEC) = 7
2844  openat(AT_FDCWD, "/etc/host.conf", O_RDONLY|O_CLOEXEC <unfinished...>
2844  <... openat resumed>)             = 7
2844  openat(AT_FDCWD, "/etc/resolv.conf", O_RDONLY|O_CLOEXEC) = 7
2844  openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 7
2844  openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libnss_files.so.2", O_RDONLY|O_CLOEXEC) = 7
2844  openat(AT_FDCWD, "/etc/hosts", O_RDONLY|O_CLOEXEC) = 7
2844  openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC <unfinished ...>
2844  <... openat resumed>)             = 7
2844  openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libnss_dns.so.2", O_RDONLY|O_CLOEXEC) = 7

So the curl command within the pod is not looking at either /etc/resolv.conf OR /etc/nsswitch.conf.

I am puzzled why the curl's behaviour within two pods with same image and same curl version in the same cluster is different.

"so curl is not checking for /etc/resolv.conf." It does but not directly, it uses the libc for that. `ltrace` might be better in such cases to see things. Also you are not showing exactly what you do with `dig`, the command and the reply. — Patrick Mevzek, Aug 31 '20 at 21:54
Also see https://unix.stackexchange.com/questions/457166/can-not-resolve-local-domains-internal-to-my-office-lan and the answer explaining to stay off `.local` to avoid problems. — Patrick Mevzek, Aug 31 '20 at 21:59
@PatrickMevzek , I added the output of dig. Thanks for sharing the other link. — jkalwar, Sep 01 '20 at 05:27
@e2-e4 , I added the contents of nsswitch.conf file , it looked ok to me. Please check once. — jkalwar, Sep 01 '20 at 05:28
@PatrickMevzek , I have tried to _curl backend-service_ from the frontend service pod , the strace logs clearly show calls going to /etc/resolv/conf , but these calls are missing while doing a _curl frontend-service_ from the backend pod. So I was of the impression that curl is not reaching the correct name server to resolve the internal domain name. — jkalwar, Sep 01 '20 at 05:33

score 2 · Answer 1 · answered Sep 18 '20 at 15:40

After trying a lot of options , I tried to debug my deployment configuration file that I was using to deploy the pod to AKS cluster. I had a host mount based volume that was pointing to the path "/var/run".

Once I removed the host mount , the curl and wget worked as expected.

After discussing this behaviour with MS support, they confirmed that curl and wget are not falling back to /etc/resolv.conf file for DNS resolution if you have a host mount pointed to path "/var/run" may be due to the way DNS probing is implemented in curl and wget.

curl & wget cannot resolve internal dns names within the aks cluster but nslookup , host , dig work fine

Edit 1

Edit 2

Approach 1

Approach 2

1 Answers1