GKE does not scale to/from 0 when autoscaling enabled

Question

I want to run a CronJob on my GKE in order to perform a batch operation on a daily basis. The ideal scenario would be for my cluster to scale to 0 nodes when the job is not running and to dynamically scale to 1 node and run the job on it every time the schedule is met.

I am first trying to achieve this by using a simple CronJob found in the kubernetes doc that only prints the current time and terminates.

I first created a cluster with the following command:

gcloud container clusters create $CLUSTER_NAME \
    --enable-autoscaling \
    --min-nodes 0 --max-nodes 1 --num-nodes 1 \
    --zone $CLUSTER_ZONE

Then, I created a CronJob with the following description:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hello
spec:
  schedule: "1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: hello
            image: busybox
            args:
            - /bin/sh
            - -c
            - date; echo Hello from the Kubernetes cluster
          restartPolicy: Never

The job is scheduled to run every hour and to print the current time before terminating.

First thing, I wanted to create the cluster with 0 nodes but setting --num-nodes 0 results in an error. Why is it so? Note that I can manually scale down the cluster to 0 nodes after it has been created.

Second, if my cluster has 0 nodes, the job won't be scheduled because the cluster does not scale to 1 node automatically but instead gives the following error:

Cannot schedule pods: no nodes available to schedule pods.

Third, if my cluster has 1 node, the job runs normally but after that, the cluster won't scale down to 0 nodes but stay with 1 node instead. I let my cluster run for two successive jobs and it did not scale down in between. I assume one hour should be long enough for the cluster to do so.

What am I missing?

EDIT: I've got it to work and detailed my solution here.

Robert Lacok · Accepted Answer · 2019-06-14T20:48:08.683

3

Update:

Note: Beginning with Kubernetes version 1.7, you can specify a minimum size of zero for your node pool. This allows your node pool to scale down completely if the instances within aren't required to run your workloads.

https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler

Old answer:

Scaling the entire cluster to 0 is not supported, because you always need at least one node for system pods:

See docs

You could create one node pool with a small machine for system pods, and an additional node pool with a big machine where you would run your workload. This way the second node pool can scale down to 0 and you still have space to run the system pods.

After attempting, @xEc mentions: Also note that there are scenarios in which my node pool wouldn't scale, like if I created the pool with an initial size of 0 instead of 1.

Initial suggestion:

Perhaps you could run a micro VM, with cron to scale the cluster up, submit a Job (instead of CronJob), wait for it to finish and then scale it back down to 0?

edited Jun 14 '19 at 20:48

answered Aug 15 '18 at 09:08

Robert Lacok

4,176
2
26
38

Oh, I was not sure how to interpret that information from the documentation. I guess your solution is an option. That being said, how would you track the completion of the job on the cluster? – xEc Aug 15 '18 at 11:30
1

I agree it's not a particularly nice one. Well, the information about a Job completion is still available on the cluster (`kubectl get jobs`). Alternatively you could try running Airflow to both do the scaling/submitting and tracking of completeness. But I never had much good experience with it. What about adding a nodepool with a very small machine, which would keep running, and let the nodepool with the large machine scale down to 0? – Robert Lacok Aug 15 '18 at 11:42
So you mean having a node pool with a small machine that runs constantly for the system pods and having another node pool with the large machine that scales on demand? Would the large machine node pool scale down automatically from/to 0 node? I guess I'd have to add ressources specifications to the CronJob in order for it to be scheduled on the correct node pool right? – xEc Aug 15 '18 at 12:03
Yes, that would be the general idea to try out. And yes, by specifying 'requires' CPU you could have your cron job get scheduled on the big machine. – Robert Lacok Aug 15 '18 at 12:31
Ok, I got this to work by doing what you suggested, namely running two different node pools. Do you want to answer again on this thread such that I can accept your answer better than this one which went in length in the comments? If not, I will simply accept this one. Also note that there are scenarios in which my node pool wouldn't scale, like if I created the pool with an initial size of 0 instead of 1. I am not sure why... – xEc Aug 17 '18 at 08:25
@xEc I modified the original answer – Robert Lacok Aug 17 '18 at 08:40
"Note: Beginning with Kubernetes version 1.7, you can specify a minimum size of zero for your node pool. This allows your node pool to scale down completely if the instances within aren't required to run your workloads." Source: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler – StockB Jun 14 '19 at 13:53
It has just been announced that CA/NAP will now work on an empty cluster, so it should scale up from an initial node count of 0: "GKE CA / NAP will now support working on empty clusters. Currently, CA/NAP is disabled when the cluster has zero nodes. With the new support, CA / NAP will continue to work even on empty clusters and scale up the cluster for any pending Kubernetes Pods." – Gábor Farkas Oct 15 '21 at 05:47

score 0 · Answer 2 · answered Aug 15 '18 at 09:30

I do not think it's a good idea to tweak GKE for this kind of job. If you really need 0 instances I'd suggest you use either

App Engine Standard Environment, which allows you scale Instances to 0 (https://cloud.google.com/appengine/docs/standard/go/config/appref) or
Cloud Functions, they are 'instanceless'/serverless anyway. You can use this unofficial guide to trigger your Cloud Functions (https://cloud.google.com/community/tutorials/using-stackdriver-uptime-checks-for-scheduling-cloud-functions)

Sadly this won't work for me as I need a machine with a high number of vCPUs to take advantage of multiprocessing. — xEc, Aug 15 '18 at 11:25

GKE does not scale to/from 0 when autoscaling enabled

2 Answers2

Linked