I can use GKE Autopilot to run arbitrary workloads on a sandbox project (with default networks, default service account, default firewall rules) just fine.
But I need to create a GKE Autopilot cluster in an existing project which isn't using the default settings for a few different things like networking and when I try, the pods never get run. My problem lies in identifying the underlying reason for the failure and which part of project setup is preventing GKE Autopilot to work.
The error messages and logs are very very scarse. The only things that I see are:
- in the workloads UI, for my pod, it says "Pod unschedulable"
- in the pod UI, under events, it says "no nodes available to schedule pods" and "pod triggered scale-up: [{...url-of-an-instance-group...}]"
- under the cluster autoscaler logs, there is a "scale.up.error.waiting.for.instances.timeout" buried in a resultInfo log (with a reference to a instance group url)
I can't find anything online about why the scaling up would fail in the Autopilot mode which is supposed to be such a hands-off experience. I understand I'm not giving much details about the pod specification (any would fail!) or my project settings, but simply where to look next would be helpful in my current situation.