I am running a GPU intensive workload on demand on GKE Standard, where I have created the appropriate node pool with minimum 0 and maximum 5 nodes. However, when a Job is scheduled on the node pool, GKE presents the following error:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 59s (x2 over 60s) default-scheduler 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector.
Normal NotTriggerScaleUp 58s cluster-autoscaler pod didn't trigger scale-up: 1 node(s) had taint {nvidia.com/gpu: present}, that the pod didn't tolerate, 1 in backoff after failed scale-up
I have set up nodeSelector according to the documentation and I have autoscaling enabled, I can confirm it does find the node pool in spite of the error saying "didn't match Pod's node affinity/selector" and tries to scale up the cluster. But then it fails shortly thereafter saying 0/1 nodes are available? Which is completely false, seeing there are 0/5 nodes used in the node pool. What am I doing wrong here?