Intermittent failure of K8S DNS resolver / dial udp / Operation cancelled

Question

We're having a medium sized Kubernetes cluster. So imagine a situation where approximately 70 pods are being connecting to a one socket server. It works fine most of the time, however, from time to time one or two pods just fail to resolve k8s DNS, and it times out with the following error:

Error: dial tcp: lookup thishost.production.svc.cluster.local on 10.32.0.10:53: read udp 100.65.63.202:36638->100.64.209.61:53: i/o timeout at

What we noticed is that this is not the only service that's failing intermittently. Other services experience that from time to time. We used to ignore it, since it was very random and rate, however in the above case that is very noticeable. The only solution is to actually kill the faulty pod. (Restarting doesn't help)

Has anyone experienced this? Do you have any tips on how to debug it/ fix?

It almost feels as if it's beyond our expertise and is fully related to the internals of the DNS resolver.

Kubernetes version: 1.23.4 Container Network: cilium

score 1 · Answer 1 · answered May 24 '22 at 09:44

1

this issue most probably will be related to the CNI. I would suggest following the link to debug the issue: https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/

and to be able to help you we need more information:

is this cluster on-premise or cloud?
what are you using for CNI?
how many nodes are running and are they all in the same subnet? if yes, dose they have other interfaces?
share the below command result.

kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o wide
when you restart the pod to solve the issue temp does it stay on the same node or does it change?

answered May 24 '22 at 09:44

mohalahmad

76
5

1. Cloud (Scaleway) – user3677173 May 24 '22 at 09:57
2. I guess we don't use any plugins – user3677173 May 24 '22 at 09:58
3. It depends. Usually around 60-80 nodes running on the same network – user3677173 May 24 '22 at 09:59
5. Restarting the pod doesn't solve the issue. Re-creating it solves it – user3677173 May 24 '22 at 09:59
4. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES coredns-85b59c748-scffz 1/1 Running 0 26h 100.64.209.61 scw-postgres-d5f4af91bb434bd39d6e40ff – user3677173 May 24 '22 at 10:00
so this `coredns` deployment is running on a single node. Should it be replicated? – user3677173 May 24 '22 at 10:02
I suggest to follow the debugging in this link https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/ – mohalahmad May 24 '22 at 10:22
yes, usually coredns have 2 replicas. I believe the issue is when the coredns are working in one node and the service on the other node and these nodes face issue communicating with each other, you can check by ssh into a node and try telnet on 100.64.209.61 port 53. the IP from the command result you shared. – mohalahmad May 24 '22 at 10:26
Thank you for you tips! I will go through it! – user3677173 May 24 '22 at 10:55

Intermittent failure of K8S DNS resolver / dial udp / Operation cancelled

1 Answers1