1

I have an existing cluster running k8s version 1.12.8 on AWS EC2. The cluster contains several pods - some serving web traffic and others configured as scheduled CronJobs. The cluster has been running fine in it's current configuration for at least 6 months, with CronJobs running every 5 minutes.

Recently, the CronJobs simply stopped. Viewing the pods via kubectl shows all the scheduled CronJobs last run was at roughly the same time. Logs sent to AWS Cloudwatch show no error output, and stop at the same time kubectl shows for the last run.

In trying to diagnose this issue I have found a broader pattern of the cluster being unresponsive to changes, eg: I cannot retrieve logs or nodes via kubectl.

I deleted Pods in Replica Sets and they never return. I've set autoscale values on Replica Sets and nothing happens.

Investigation of the kubelet logs on the master instance revealed repeating errors, coinciding with the time the failure was first noticed:

I0805 03:17:54.597295 2730 kubelet.go:1928] SyncLoop (PLEG): "kube-scheduler-ip-x-x-x-x.z-west-y.compute.internal_kube-system(181xxyyzz)", event: &pleg.PodLifecycleEvent{ID:"181xxyyzz", Type:"ContainerDied", Data:"405ayyzzz"}
        ...
E0805 03:18:10.867737 2730 kubelet_node_status.go:378] Error updating node status, will retry: failed to patch status "{\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"NetworkUnavailable\"},{\"type\":\"OutOfDisk\"},{\"type\":\"MemoryPressure\"},{\"type\":\"DiskPressure\"},{\"type\":\"PIDPressure\"},{\"type\":\"Ready\"}],"conditions\":[{\"lastHeartbeatTime\":\"2020-08-05T03:18:00Z\",\"type\":\"OutOfDisk\"},{\"lastHeartbeatTime\":\"2020-08-05T03:18:00Z\",\"type\":\"MemoryPressure\"},{\"lastHeartbeatTime\":\"2020-08-05T03:18:00Z\",\"type\":\"DiskPressure\"},{\"lastHeartbeatTime\":\"2020-08-05T03:18:00Z\",\"type\":\"PIDPressure\"},{\"lastHeartbeatTime\":\"2020-08-05T03:18:00Z\",\"type\":\"Ready\"}]}}" for node "ip-172-20-60-88.eu-west-2.compute.internal": Patch https://127.0.0.1/api/v1/nodes/ip-x-x-x-x.z-west-y.compute.internal/status?timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
...
E0805 03:18:20.869436 2730 kubelet_node_status.go:378] Error updating node status, will retry: error getting node "ip-172-20-60-88.eu-west-2.compute.internal": Get https://127.0.0.1/api/v1/nodes/ip-172-20-60-88.eu-west-2.compute.internal?timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Running docker ps on the master node shows that both k8s_kube-controller-manager_kube-controller-manager and k8s_kube-scheduler_kube-scheduler containers were started 6 days ago, where the other k8s containers are at 8+ months.

tl;dr A container on my main node (likely kube-scheduler, kube-controller-manager or both) died. The containers have come back up but are unable to communicate with the existing nodes - this is preventing any scheduled CronJobs or new deployments from being satisfied.

How can re-configure kubelet and associated services on the master node to communicate again with the worker nodes?

duncanhall
  • 11,035
  • 5
  • 54
  • 86
  • 1
    In this case you could start by checking the log output of the kubelets, that would be my first guess. See also: https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/#looking-at-logs – Blokje5 Aug 11 '20 at 13:13
  • @Blokje5 Thanks - I've added quite a lot of new findings above - any further help you're able to give is much appreciated – duncanhall Aug 11 '20 at 15:33

1 Answers1

1

From the docs on Troubleshoot Clusters

Digging deeper into the cluster requires logging into the relevant machines. Here are the locations of the relevant log files. (note that on systemd-based systems, you may need to use journalctl instead)

Master Nodes

/var/log/kube-apiserver.log - API Server, responsible for serving the API

/var/log/kube-scheduler.log - Scheduler, responsible for making scheduling decisions

/var/log/kube-controller-manager.log - Controller that manages replication controllers

Worker Nodes

/var/log/kubelet.log - Kubelet, responsible for running containers on the node /var/log/kube-proxy.log - Kube Proxy, responsible for service load balancing

Another way to get logs is to use docker ps to get containerid and then use docker logs containerid

If you have (which you should) a monitoring system setup using prometheus and Grafana you can check metrics such as high cpu load on API Server pods

Arghya Sadhu
  • 41,002
  • 9
  • 78
  • 107