We're having a medium sized Kubernetes cluster. So imagine a situation where approximately 70 pods are being connecting to a one socket server. It works fine most of the time, however, from time to time one or two pods just fail to resolve k8s DNS, and it times out with the following error:
Error: dial tcp: lookup thishost.production.svc.cluster.local on 10.32.0.10:53: read udp 100.65.63.202:36638->100.64.209.61:53: i/o timeout at
What we noticed is that this is not the only service that's failing intermittently. Other services experience that from time to time. We used to ignore it, since it was very random and rate, however in the above case that is very noticeable. The only solution is to actually kill the faulty pod. (Restarting doesn't help)
Has anyone experienced this? Do you have any tips on how to debug it/ fix?
It almost feels as if it's beyond our expertise and is fully related to the internals of the DNS resolver.
Kubernetes version: 1.23.4 Container Network: cilium