I am running workloads on a spot GPU node pool & intermittently getting 'NodeNotReady' followed by a reboot/restart of the node (& loss of the the workload pod), however the node does not go away but reboots & the kubelet and becomes ready again after a few minutes (see attached).
I am new to using the spot gpu node types so was wondering if this is to be expected?
If the underlying node is being prempted how can I surface the termination event? https://cloud.google.com/compute/docs/instances/spot#preemption
[EDIT]
After trawling through the logs it looks like the underlying VM is pre-empted & immediately replaced with a new instance, while the k8s node identity remains the same:
so looks like i answered by own questions above, however, I am wondering how often I can expect these pre-emption events to occur? I have used the same spot instances outside of GKE (just as basic VMs) & didn't experience hourly pre-empting like this - in fact I have run workloads there for days without a pre-emption event - perhaps it works differently for GKE?