how so I troubleshoot intermittent node/kubelt reboots on a GKE

Asked Dec 22 '22 at 16:53

Active Dec 23 '22 at 15:27

Viewed 64 times

I am running workloads on a spot GPU node pool & intermittently getting 'NodeNotReady' followed by a reboot/restart of the node (& loss of the the workload pod), however the node does not go away but reboots & the kubelet and becomes ready again after a few minutes (see attached).

I am new to using the spot gpu node types so was wondering if this is to be expected?

If the underlying node is being prempted how can I surface the termination event? https://cloud.google.com/compute/docs/instances/spot#preemption

event log

[EDIT]

After trawling through the logs it looks like the underlying VM is pre-empted & immediately replaced with a new instance, while the k8s node identity remains the same:

so looks like i answered by own questions above, however, I am wondering how often I can expect these pre-emption events to occur? I have used the same spot instances outside of GKE (just as basic VMs) & didn't experience hourly pre-empting like this - in fact I have run workloads there for days without a pre-emption event - perhaps it works differently for GKE?

edited Dec 23 '22 at 15:27

rupello

asked Dec 22 '22 at 16:53

Rupert Lloyd

how so I troubleshoot intermittent node/kubelt reboots on a GKE

0 Answers0