Hi community and K8s experts,
I installed a clean K8s cluster based on virtual machines (Debian 10). After the installation and the integration into my landscape, I repaired in the first step the coreDNS resolution. I did further test's and found the following. The test setup consisted of a google.com nslookup and a local pod lookup on a k8s DNS address.
Basic setup:
- K8s version: 1.19.0
- K8s setup: 1 master + 2 worker nodes
- Based on: Debian 10 VM's
- CNI: Flannel
Status of CoreDNS Pods
kube-system coredns-xxxx 1/1 Running 1 26h
kube-system coredns-yyyy 1/1 Running 1 26h
CoreDNS Log:
.:53
[INFO] plugin/reload: Running configuration MD5 = 4e235fcc3696966e76816bcd9034ebc7
CoreDNS-1.6.7
CoreDNS config:
apiVersion: v1
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
kind: ConfigMap
metadata:
creationTimestamp: ""
name: coredns
namespace: kube-system
resourceVersion: "219"
selfLink: /api/v1/namespaces/kube-system/configmaps/coredns
uid: xxx
CoreDNS Service
kubectl -n kube-system get svc -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 15d k8s-app=kube-dns
Kubelet config yaml
apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
anonymous:
enabled: false
webhook:
cacheTTL: 0s
enabled: true
x509:
clientCAFile: /etc/kubernetes/pki/ca.crt
authorization:
mode: Webhook
webhook:
cacheAuthorizedTTL: 0s
cacheUnauthorizedTTL: 0s
clusterDNS:
- 10.96.0.10
clusterDomain: cluster.local
cpuManagerReconcilePeriod: 0s
evictionPressureTransitionPeriod: 0s
fileCheckFrequency: 0s
healthzBindAddress: 127.0.0.1
healthzPort: 10248
httpCheckFrequency: 0s
imageMinimumGCAge: 0s
kind: KubeletConfiguration
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
rotateCertificates: true
runtimeRequestTimeout: 0s
staticPodPath: /etc/kubernetes/manifests
streamingConnectionIdleTimeout: 0s
syncFrequency: 0s
volumeStatsAggPeriod: 0s
Output of pods resolv.conf
/ # cat /etc/resolv.conf
nameserver 10.96.0.10
search development.svc.cluster.local svc.cluster.local cluster.local invalid
options ndots:5
Output of host resolv.conf
cat /etc/resolv.conf
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
# DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 213.136.95.11
nameserver 213.136.95.10
search invalid
Output of host /run/flannel/subnet.env
cat /run/flannel/subnet.env
FLANNEL_NETWORK=10.244.0.0/16
FLANNEL_SUBNET=10.244.0.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true
Test setup
kubectl exec -i -t busybox -n development -- nslookup google.com
kubectl exec -i -t busybox -n development -- nslookup development.default
Busybox v1.28 image
- google.com nslookup works answer takes very long
- local pod dns address fails answer takes very long
Test setup
kubectl exec -i -t dnsutils -- nslookup google.com
kubectl exec -i -t busybox -n development -- nslookup development.default
K8s dnsutils test image
- google.com nslookup works sporadically It feels like sometimes the address is pulled from a cache and sometimes it does not work.
- local pod dns address works sporadically It feels like sometimes the address is pulled from a cache and sometimes it does not work.
Test setup
kubectl exec -i -t dnsutilsalpine -n development -- nslookup google.com
kubectl exec -i -t dnsutilsalpine -n development -- nslookup development.default
Alpine image v3.12
- google.com nslookup works sporadically It feels like sometimes the address is pulled from a cache and sometimes it does not work.
- local pod dns address fails
The logs are empty. Do you have an idea where the problem is?
IP Routes master node
default via X.X.X.X dev eth0 onlink
10.244.0.0/24 dev cni0 proto kernel scope link src 10.244.0.1
10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink
10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink
X.X.X.X via X.X.X.X dev eth0
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
UPDATE
I reinstalled the cluster and now I use Calico as CNI and have the same problem.
UPDATE 2
After a detailed error analysis under Calico, I found out that the corresponding pods did not work properly. I analyzed the error in detail and could find out that the corresponding port 179 was not opened by me in the firewall. After fixing this error, I was able to determine the proper function of the pods and confirmed that now the resolution of the names is also working.