0

Hi community and K8s experts,

I installed a clean K8s cluster based on virtual machines (Debian 10). After the installation and the integration into my landscape, I repaired in the first step the coreDNS resolution. I did further test's and found the following. The test setup consisted of a google.com nslookup and a local pod lookup on a k8s DNS address.

Basic setup:

  • K8s version: 1.19.0
  • K8s setup: 1 master + 2 worker nodes
  • Based on: Debian 10 VM's
  • CNI: Flannel

Status of CoreDNS Pods

kube-system            coredns-xxxx 1/1     Running   1          26h
kube-system            coredns-yyyy 1/1     Running   1          26h

CoreDNS Log:

.:53
[INFO] plugin/reload: Running configuration MD5 = 4e235fcc3696966e76816bcd9034ebc7
CoreDNS-1.6.7

CoreDNS config:

apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }
kind: ConfigMap
metadata:
  creationTimestamp: ""
  name: coredns
  namespace: kube-system
  resourceVersion: "219"
  selfLink: /api/v1/namespaces/kube-system/configmaps/coredns
  uid: xxx

CoreDNS Service

kubectl -n kube-system get svc -o wide
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE   SELECTOR
kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   15d   k8s-app=kube-dns

Kubelet config yaml

apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
  anonymous:
    enabled: false
  webhook:
    cacheTTL: 0s
    enabled: true
  x509:
    clientCAFile: /etc/kubernetes/pki/ca.crt
authorization:
  mode: Webhook
  webhook:
    cacheAuthorizedTTL: 0s
    cacheUnauthorizedTTL: 0s
clusterDNS:
- 10.96.0.10
clusterDomain: cluster.local
cpuManagerReconcilePeriod: 0s
evictionPressureTransitionPeriod: 0s
fileCheckFrequency: 0s
healthzBindAddress: 127.0.0.1
healthzPort: 10248
httpCheckFrequency: 0s
imageMinimumGCAge: 0s
kind: KubeletConfiguration
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
rotateCertificates: true
runtimeRequestTimeout: 0s
staticPodPath: /etc/kubernetes/manifests
streamingConnectionIdleTimeout: 0s
syncFrequency: 0s
volumeStatsAggPeriod: 0s

Output of pods resolv.conf

/ # cat /etc/resolv.conf 
nameserver 10.96.0.10
search development.svc.cluster.local svc.cluster.local cluster.local invalid
options ndots:5

Output of host resolv.conf

cat /etc/resolv.conf 
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 213.136.95.11
nameserver 213.136.95.10
search invalid

Output of host /run/flannel/subnet.env

cat /run/flannel/subnet.env
FLANNEL_NETWORK=10.244.0.0/16
FLANNEL_SUBNET=10.244.0.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true

Test setup

kubectl exec -i -t busybox -n development -- nslookup google.com
kubectl exec -i -t busybox -n development -- nslookup development.default

Busybox v1.28 image

  • google.com nslookup works answer takes very long
  • local pod dns address fails answer takes very long

Test setup

kubectl exec -i -t dnsutils -- nslookup google.com
kubectl exec -i -t busybox -n development -- nslookup development.default

K8s dnsutils test image

  • google.com nslookup works sporadically It feels like sometimes the address is pulled from a cache and sometimes it does not work.
  • local pod dns address works sporadically It feels like sometimes the address is pulled from a cache and sometimes it does not work.

Test setup

kubectl exec -i -t dnsutilsalpine -n development -- nslookup google.com
kubectl exec -i -t dnsutilsalpine -n development -- nslookup development.default

Alpine image v3.12

  • google.com nslookup works sporadically It feels like sometimes the address is pulled from a cache and sometimes it does not work.
  • local pod dns address fails

The logs are empty. Do you have an idea where the problem is?

IP Routes master node

default via X.X.X.X dev eth0 onlink 
10.244.0.0/24 dev cni0 proto kernel scope link src 10.244.0.1 
10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink 
10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink 
X.X.X.X via X.X.X.X dev eth0 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown

UPDATE

I reinstalled the cluster and now I use Calico as CNI and have the same problem.

UPDATE 2

After a detailed error analysis under Calico, I found out that the corresponding pods did not work properly. I analyzed the error in detail and could find out that the corresponding port 179 was not opened by me in the firewall. After fixing this error, I was able to determine the proper function of the pods and confirmed that now the resolution of the names is also working.

ZPascal
  • 143
  • 1
  • 1
  • 7
  • You need to edit your question and include the debugging steps you have already tried; I would _guess_ [the `clusterDNS:`](https://pkg.go.dev/k8s.io/kubernetes@v1.19.1/pkg/kubelet/apis/config?tab=doc#KubeletConfiguration) is pointed to the wrong value – mdaniel Sep 13 '20 at 17:50
  • @mdaniel I update the config. – ZPascal Sep 13 '20 at 18:48
  • You posted the value of the coredns pods, but not the `Service`; `kubectl -n kube-system get svc -o wide` and also did you check the `clusterDNS:` value in `/var/lib/kubelet/config.yaml` to ensure it matches that coredns `Service`? – mdaniel Sep 14 '20 at 03:49
  • @mdaniel The service and the config yaml looks normal. I have added them in the main post. – ZPascal Sep 14 '20 at 05:58
  • have you been checking workarounds from here: https://github.com/coreos/flannel/issues/1245 ? – Nick Sep 14 '20 at 11:54
  • @Nick No, I haven't tried it yet. Can someone compare my IP (main post) routes with their own? – ZPascal Sep 15 '20 at 07:41
  • I just need to install a cluster with flannel. Will give it a try. – Nick Sep 15 '20 at 07:51
  • how exactly you've been installing k8s cluster ? is there any doc? :) – Nick Sep 16 '20 at 22:05
  • @NickI have used this [tutorial](https://www.digitalocean.com/community/tutorials/how-to-create-a-kubernetes-cluster-using-kubeadm-on-ubuntu-18-04) including the current kubernetes and flannel version. – ZPascal Sep 17 '20 at 05:59
  • did you install https://raw.githubusercontent.com/coreos/flannel/a70459be0084506e4ec919aa1c114638878db11b/Documentation/kube-flannel.yml as per tutorial, or downloaded the newest version from gtihub.com/coreos/flannel/Documentation/kube-flannel.yml ? I'm reproducing your setup (on Debian10 instances on GCP) – Nick Sep 17 '20 at 13:44
  • @Nick I downloaded the newest version. Many thanks for your effort. – ZPascal Sep 19 '20 at 22:18

2 Answers2

0

Unable to post that much via comments. Posting as an answer.

I checked the the guide you've been referring to and set up my own test cluster (GCP, 3xDebian10 VMs).

The difference is that in my ~/kube-cluster/master.yml I've set different link to kube-flannel.yml (and the content of that file differs from the file in the guide :))

$ grep http master.yml 
      shell: kubectl apply -f  https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml >> pod_network_setup.txt

On my cluster:

$ kubectl get nodes
NAME         STATUS   ROLES    AGE     VERSION
instance-1   Ready    master   2m48s   v1.19.0
instance-2   Ready    <none>   38s     v1.19.0
instance-3   Ready    <none>   38s     v1.19.0

kubectl get pods -o wide -n kube-system
NAME                                 READY   STATUS    RESTARTS   AGE     IP            NODE         NOMINATED NODE   READINESS GATES
coredns-f9fd979d6-8sxg7              1/1     Running   0          4m48s   10.244.0.2    instance-1   <none>           <none>
coredns-f9fd979d6-z5gdl              1/1     Running   0          4m48s   10.244.0.3    instance-1   <none>           <none>

kube-flannel-ds-4khll                1/1     Running   0          2m58s   10.156.0.21   instance-3   <none>           <none>
kube-flannel-ds-h8d9l                1/1     Running   0          2m58s   10.156.0.20   instance-2   <none>           <none>
kube-flannel-ds-zhzbf                1/1     Running   0          4m49s   10.156.0.19   instance-1   <none>           <none>

$ kubectl -n kube-system get svc -o wide
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE     SELECTOR
kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   6m15s   k8s-app=kube-dns

sammy@instance-1:~$ ip route
default via 10.156.0.1 dev ens4 
10.156.0.1 dev ens4 scope link 
10.244.0.0/24 dev cni0 proto kernel scope link src 10.244.0.1 
10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink 
10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown


I see no DNS lag issues.

kubectl create deployment busybox --image=nkolchenko/enea:server_go_latest
deployment.apps/busybox created

sammy@instance-1:~$ time kubectl exec -it busybox-6f744547bf-hkxnk -- nslookup default.default
Server:         10.96.0.10
Address:        10.96.0.10:53

** server can't find default.default: NXDOMAIN

** server can't find default.default: NXDOMAIN

command terminated with exit code 1

real    0m0.227s
user    0m0.106s
sys     0m0.012s


sammy@instance-1:~$ time kubectl exec -it busybox-6f744547bf-hkxnk -- nslookup google.com
Server:         10.96.0.10
Address:        10.96.0.10:53

Non-authoritative answer:
Name:   google.com
Address: 172.217.22.78

Non-authoritative answer:
Name:   google.com
Address: 2a00:1450:4001:820::200e


real    0m0.223s
user    0m0.102s
sys     0m0.012s

Let me know if you need me to run any other tests, I'll keep this cluster throughout the weekend and then tear it down.

UPDATE:

$ cat ololo 
apiVersion: v1
kind: Pod
metadata:
  name: dnsutils
  namespace: default
spec:
  containers:
  - name: dnsutils
    image: gcr.io/kubernetes-e2e-test-images/dnsutils:1.3
    command:
      - sleep
      - "3600"
    imagePullPolicy: IfNotPresent
  restartPolicy: Always

$ kubectl create -f ololo 
pod/dnsutils created


$ kubectl get -A all  -o wide | grep dns
default       pod/dnsutils                             1/1     Running   0          63s     10.244.2.8    instance-2   <none>           <none>
kube-system   pod/coredns-cc8845745-jtvlh              1/1     Running   0          10m     10.244.1.3    instance-3   <none>           <none>
kube-system   pod/coredns-cc8845745-xxh28              1/1     Running   0          10m     10.244.0.4    instance-1   <none>           <none>
kube-system   pod/coredns-cc8845745-zlv84              1/1     Running   0          10m     10.244.2.6    instance-2   <none>           <none>

instance-1:~$ kubectl exec -i -t dnsutils -- time nslookup google.com
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   google.com
Address: 172.217.21.206
Name:   google.com
Address: 2a00:1450:4001:818::200e

real    0m 0.01s
user    0m 0.00s
sys     0m 0.00s




Nick
  • 151
  • 7
  • Many thanks for the tests and the cluster installation. I can't see any difference to my IP routes at the moment. Could you please do another test with the dnsutils image, three coreDNS instances and post the routes of the worker nodes? – ZPascal Sep 19 '20 at 22:16
  • @ZPascal , please share _exact_ dnsutils image location? WHich exactly image shall I use? --image=dnsutils ? tutum/dnsutils ? something else? gcr.io/kubernetes-e2e-test-images/dnsutils:1.3 ? – Nick Sep 22 '20 at 14:36
  • updated my answer. No issues with the DNS. Alpine, dnsutils, busybox .. all works. Tearing down the cluster. – Nick Sep 22 '20 at 14:44
  • I used the gcr.io/kubernetes-e2e-test-images/dnsutils:1.3 image. – ZPascal Sep 27 '20 at 09:05
  • Thank you very much. I will now reinstall my cluster and try Calico as CNI. – ZPascal Sep 27 '20 at 09:10
0

After I installed Calico and set the appropriate firewall rules (open the port 179 on all nodes), I could see the smooth functioning of the coreDNS Pods. So it was possible that the different images could resolve the DNS addresses and the forwarding could be done correctly.

ZPascal
  • 143
  • 1
  • 1
  • 7