0

I have Kubernetes version 1.16.11 setup in ubuntu 16.04 and weave for CNI network. I have web application which communicates with services in same cluster. web application shows result 1st time and later it doesn't show and request to services are in pending state. Applications running in pods responds are very slow or gets time out. I doubt this is because of coredns. I can see coredns pod get crashes frequently. Before i hit web application READY state for cordns pods are 1/1 but after few seconds it goes to 0/1 and then status goes to CrashLoopBackOff.

Image1

Following are events in one of the coredns pods

Image2

following are activities (pod logs) happen in coredns before it stops itself

.:53
2020-08-19T05:41:42.572Z [INFO] plugin/reload: Running configuration MD5 = ed017d072d4dd28f5c79c00674bf5857
2020-08-19T05:41:42.572Z [INFO] CoreDNS-1.6.2
2020-08-19T05:41:42.572Z [INFO] linux/amd64, go1.12.8, 795a3eb
CoreDNS-1.6.2
linux/amd64, go1.12.8, 795a3eb
2020-08-19T05:41:42.600Z [INFO] 127.0.0.1:46832 - 49906 "HINFO IN 7243822456926758847.6833987984127310077. udp 57 false 512" NXDOMAIN qr,rd,ra 132 0.028488974s
2020-08-19T05:53:01.623Z [INFO] 10.47.0.0:30093 - 50722 "A IN checkpoint-api.weave.works.kube-system.svc.cluster.local. udp 74 false 512" NXDOMAIN qr,aa,rd 167 0.000213433s
2020-08-19T05:53:01.624Z [INFO] 10.47.0.0:57021 - 58779 "AAAA IN checkpoint-api.weave.works.kube-system.svc.cluster.local. udp 74 false 512" NXDOMAIN qr,aa,rd 167 0.000202274s
2020-08-19T05:53:01.626Z [INFO] 10.47.0.0:31645 - 42289 "A IN checkpoint-api.weave.works.svc.cluster.local. udp 62 false 512" NXDOMAIN qr,aa,rd 155 0.000118395s
2020-08-19T05:53:01.629Z [INFO] 10.47.0.0:61413 - 46930 "AAAA IN checkpoint-api.weave.works.cluster.local. udp 58 false 512" NXDOMAIN qr,aa,rd 151 0.000121312s
2020-08-19T05:53:01.632Z [INFO] 10.47.0.0:19447 - 15534 "AAAA IN checkpoint-api.weave.works.ec2.internal. udp 57 false 512" NXDOMAIN qr,rd,ra 57 0.001751403s
2020-08-19T05:53:01.632Z [INFO] 10.47.0.0:9886 - 17303 "A IN checkpoint-api.weave.works.ec2.internal. udp 57 false 512" NXDOMAIN qr,rd,ra 57 0.001958423s
2020-08-19T05:57:15.091Z [INFO] 10.32.0.1:53297 - 34003 "A IN api-service.default.svc.cluster.local. udp 59 false 512" NOERROR qr,aa,rd 116 0.000157001s
2020-08-19T05:57:15.097Z [INFO] 10.32.0.1:30306 - 62967 "A IN api-service.default.svc.cluster.local. udp 59 false 512" NOERROR qr,aa,rd 116 0.00008247s
2020-08-19T05:57:17.592Z [INFO] 10.32.0.1:53297 - 34003 "A IN api-service.default.svc.cluster.local. udp 59 false 512" NOERROR qr,aa,rd 116 0.000118957s
2020-08-19T05:57:17.594Z [INFO] 10.32.0.1:28474 - 9972 "A IN api-service.default.svc.cluster.local. udp 59 false 512" NOERROR qr,aa,rd 116 0.000095925s
2020-08-19T05:57:17.595Z [INFO] 10.32.0.1:16322 - 49891 "A IN api-service.default.svc.cluster.local. udp 59 false 512" NOERROR qr,aa,rd 116 0.000141925s
2020-08-19T05:57:17.599Z [INFO] 10.32.0.1:30306 - 62967 "A IN api-service.default.svc.cluster.local. udp 59 false 512" NOERROR qr,aa,rd 116 0.000080882s
[INFO] SIGTERM: Shutting down servers then terminating

and logs in pod after it crashes and trying to start again

2020-08-18T19:16:21.806Z [INFO] plugin/reload: Running configuration MD5 = f64cb9b977c7dfca58c4fab108535a76
2020-08-18T19:16:21.806Z [INFO] CoreDNS-1.6.2
2020-08-18T19:16:21.806Z [INFO] linux/amd64, go1.12.8, 795a3eb
CoreDNS-1.6.2
linux/amd64, go1.12.8, 795a3eb
2020-08-18T19:16:27.807Z [ERROR] plugin/errors: 2 1971409473292337290.6469642637242929399. HINFO: read udp 10.32.0.3:60178->10.99.0.2:53: i/o timeout
2020-08-18T19:16:30.808Z [ERROR] plugin/errors: 2 1971409473292337290.6469642637242929399. HINFO: read udp 10.32.0.3:53454->10.99.0.2:53: i/o timeout
2020-08-18T19:16:31.808Z [ERROR] plugin/errors: 2 1971409473292337290.6469642637242929399. HINFO: read udp 10.32.0.3:57946->10.99.0.2:53: i/o timeout
2020-08-18T19:16:32.808Z [ERROR] plugin/errors: 2 1971409473292337290.6469642637242929399. HINFO: read udp 10.32.0.3:56865->10.99.0.2:53: i/o timeout
2020-08-18T19:16:35.808Z [ERROR] plugin/errors: 2 1971409473292337290.6469642637242929399. HINFO: read udp 10.32.0.3:59673->10.99.0.2:53: i/o timeout
2020-08-18T19:16:38.809Z [ERROR] plugin/errors: 2 1971409473292337290.6469642637242929399. HINFO: read udp 10.32.0.3:34594->10.99.0.2:53: i/o timeout
2020-08-18T19:16:41.809Z [ERROR] plugin/errors: 2 1971409473292337290.6469642637242929399. HINFO: read udp 10.32.0.3:55874->10.99.0.2:53: i/o timeout
2020-08-18T19:16:44.810Z [ERROR] plugin/errors: 2 1971409473292337290.6469642637242929399. HINFO: read udp 10.32.0.3:52003->10.99.0.2:53: i/o timeout
2020-08-18T19:16:47.810Z [ERROR] plugin/errors: 2 1971409473292337290.6469642637242929399. HINFO: read udp 10.32.0.3:41473->10.99.0.2:53: i/o timeout
[INFO] SIGTERM: Shutting down servers then terminating

Can you please let me know whether this is reason for slowness/nonresponsiveness or other causes and what is fix for this?

Thanks in advance.

Prasad
  • 519
  • 9
  • 22
  • Where is `10.32.0.3`? Kubernetes or outside the cluster? – Rico Aug 18 '20 at 20:03
  • 10.32.0.3 is inside kubernetes – Prasad Aug 18 '20 at 20:05
  • Sorry I meant `10.99.0.2` – Rico Aug 18 '20 at 20:08
  • 10.99.0.2 also inside kubernetes, it is ip address for another coredns pod – Prasad Aug 18 '20 at 20:16
  • So you have 2 coredns deployments? The output above shows `10.32.0.3` and `10.32.0.2` as coredns pods – Rico Aug 18 '20 at 20:40
  • Can you check if the two pods are running on different nodes? If so, can those two nodes talk to each other? Make sure it's not a networking issue. – Faheem Aug 18 '20 at 21:46
  • these two coredns pods are running in same machine, which are deployed in kubernetes master – Prasad Aug 19 '20 at 05:52
  • coredns restart stop, i.e. it doesn't crash now after issuing this command ```bash $ kubectl -n kube-system get deployment coredns -o yaml | \ sed 's/allowPrivilegeEscalation: false/allowPrivilegeEscalation: true/g' | \ kubectl apply -f - ``` – Prasad Aug 19 '20 at 09:26
  • but 3 out 10 times we get bad gateway error when there is multiple calls to server – Prasad Aug 19 '20 at 09:28
  • have you found the reason for that ? – Nick Sep 07 '20 at 07:30
  • i made coredns to run in worker node as my master has less CPU and RAM compared to worker. after this change i didn't face crashes. – Prasad Sep 07 '20 at 18:53

0 Answers0