I've been having a daily issue with my Kubernetes cluster (running on 1.18) where one of the nodes will go over 100% CPU utilisation, and Kubernetes will fail to connect external visitors to my pods. (A website outage, basically)
The strange thing is the pods are always sitting at a comfortable 30% (or lower!) CPU. So the application itself seems okay.
When I describe
the node in question, I see mention of a node-problem-detector
timeout.
Events:
Type Reason Age From Message
---- ------ --- ---- -------
Normal NodeNotSchedulable 10m kubelet Node nodepoo1-vmss000007 status is now: NodeNotSchedulable
Warning KubeletIsDown 9m44s (x63 over 5h21m) kubelet-custom-plugin-monitor Timeout when running plugin "/etc/node-problem-detector.d/plugin/check_kubelet.s"
Warning ContainerRuntimeIsDown 9m41s (x238 over 5h25m) container-runtime-custom-plugin-monitor Timeout when running plugin "/etc/node-problem-detector.d/plugin/check_runtime.s"
My current approach has been to run three nodes on my nodepool, and effectively babysit Kubernetes by cordoning the troublesome node and moving all the pods onto one of the other nodes during the monitoring outage. After 15 minutes once things are back to normal, I will uncordon the effected node and start the cycle again.
I was particularly unlucky this weekend where I had three CPU spikes within 24 hours.
How can I go about fixing this issue? I can't seem to find any information on the Timeout when running plugin "/etc/node-problem-detector.d/plugin/check_kubelet.s"
problem I'm seeing.