7

I've been having a daily issue with my Kubernetes cluster (running on 1.18) where one of the nodes will go over 100% CPU utilisation, and Kubernetes will fail to connect external visitors to my pods. (A website outage, basically)

The strange thing is the pods are always sitting at a comfortable 30% (or lower!) CPU. So the application itself seems okay.

When I describe the node in question, I see mention of a node-problem-detector timeout.

Events:
  Type     Reason                  Age                      From                                     Message
  ----     ------                  ---                      ----                                     -------
  Normal   NodeNotSchedulable      10m                      kubelet                                  Node nodepoo1-vmss000007 status is now: NodeNotSchedulable
  Warning  KubeletIsDown           9m44s (x63 over 5h21m)   kubelet-custom-plugin-monitor            Timeout when running plugin "/etc/node-problem-detector.d/plugin/check_kubelet.s"
  Warning  ContainerRuntimeIsDown  9m41s (x238 over 5h25m)  container-runtime-custom-plugin-monitor  Timeout when running plugin "/etc/node-problem-detector.d/plugin/check_runtime.s"

My current approach has been to run three nodes on my nodepool, and effectively babysit Kubernetes by cordoning the troublesome node and moving all the pods onto one of the other nodes during the monitoring outage. After 15 minutes once things are back to normal, I will uncordon the effected node and start the cycle again.

I was particularly unlucky this weekend where I had three CPU spikes within 24 hours.

CPU chart showing spike from one of three nodes

How can I go about fixing this issue? I can't seem to find any information on the Timeout when running plugin "/etc/node-problem-detector.d/plugin/check_kubelet.s" problem I'm seeing.

Lukas Eichler
  • 5,689
  • 1
  • 24
  • 43
alex
  • 1,042
  • 4
  • 18
  • 33
  • 1
    Did you try to ssh into the problematic node and check the CPU usage by processes? – bluelurker Feb 08 '22 at 10:25
  • `After 15 minutes once things are back to normal, I will uncordon the effected node...` - do you mean the node get back to normal **AFTER** you shifted all running pods on it to other nodes? – gohm'c Feb 08 '22 at 10:33
  • @gohm'c Yep, that's right. After moving the pods onto different nodes, the CPU of the affected node starts dropping. – alex Feb 08 '22 at 10:38
  • @bluelurker Interesting, did not know that was an option. I'm running on Azure, so looks like I can follow these instructions: https://learn.microsoft.com/en-us/azure/aks/ssh#create-the-ssh-connection-to-a-linux-node – alex Feb 08 '22 at 10:39
  • Try this [identify high cpu consuming containers aks](https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/identify-high-cpu-consuming-containers-aks) – gohm'c Feb 08 '22 at 10:44
  • @gohm'c Nope, that doesn't show anything out of the ordinary. Max of 50% for one container, and that's it. – alex Feb 08 '22 at 15:04
  • @bluelurker My `kubectl` doesn't seem to have a `debug` command. What version of kubectl was this introduced in? – alex Feb 08 '22 at 23:10
  • 1
    Are you running linux or windows nodes? I would suggest opening an Azure Support Ticket as this is likely in AKS issue. – Lukas Eichler Feb 16 '22 at 12:13
  • 5
    I'm having this exact problem. I've spent a lot of time debugging, also together with Microsoft, and so far we have not been able to identify the cause. – lindhe Apr 25 '22 at 13:38
  • I'm having the same problem as well. The cluster ran smoothly for about a year and now twice in two weeks one node became completely unresponsive. To a point where even connecting to the node via ssh was not possible. Only restarting the VM did help. Did any of you guys found a solution? – dan Jan 02 '23 at 07:50

2 Answers2

1

You could try to open an ssh connection to the node and then check which process(es) consume CPU using top. If this process runs in a pod and your have crictl installed on the node, you can use https://github.com/k8s-school/pid2pod to retrieve the pod which is running the process.

Fabrice Jammes
  • 2,275
  • 1
  • 26
  • 39
0

Try to look into your periodSeconds and timeoutSeconds specifications. Your answer must be hidden into those specifications.

ViKi Vyas
  • 691
  • 5
  • 16