Kubernetes node pool will not autoscale to 0 nodes

Question

I have a rather expensive workload that some colleagues need running sometimes during the weekday (not on any sort of set schedule). I use Google Cloud Kubernetes.

It consists of three statefulsets, each with one replica.

I've instructed them how to turn it "on" and "off." To turn it "on," they scale each statefulset to 1 replica. To turn it "off," they scale each statefulset to 0 replicas.

Originally, I had a single autoscaling node pool with a default size of three nodes (the statefulsets each consume almost an entire node's worth of CPU and RAM). I observed that even after scaling down to 0, at least one (and sometimes two) nodes would remain after an hour or two. I was expecting that eventually all the nodes would die down, but that doesn't happen.

I noticed that the running nodes still had some pods, just in a different namespace. The remaining pods are all in the kube-system namespace, except for one in the custom-metrics namespace.

So then I thought, okay - maybe there are other services Kubernetes wants to run even when there are no user-defined workloads/pods. So I created another node pool, with a single very-small-but-adequate node. That node is big enough to run everything that Kubernetes reports is running in those non-default namespaces.

After the new node pool was running with one node, I then proceeded to manually resize the original node pool to 0. It was fine. I hoped at this point that I had a "system" node pool for running kube-system and other stuff, and a "user" node pool for running my own stuff.

So for my next test, this time I only scaled up one statefulset replica. Eventually a node came online and the statefulset pod was running/ready. I then scaled it down to 0 again and waited... and waited... and the node did not go away.

What does it take to make the autoscaling node pool actually reach 0 nodes? Clearly I am missing something (or more than something), but I have had a hard time finding information about what is necessary to trigger the node scaler to downsize a node pool to 0.

Any advice is appreciated.

Additional info

When I look at what's running on the node in the node pool I want to go to 0, here's what I see

  Namespace                  Name                                                   CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                                                   ------------  ----------  ---------------  -------------  ---
  kube-system                fluentd-gcp-v3.1.1-mfkxf                               100m (0%)     1 (3%)      200Mi (0%)       500Mi (0%)     28m
  kube-system                kube-proxy-gke-tileperformance-pool-1-14d3671d-jl76    100m (0%)     0 (0%)      0 (0%)           0 (0%)         28m
  kube-system                prometheus-to-sd-htvnw                                 1m (0%)       3m (0%)     20Mi (0%)        20Mi (0%)      28m

If I try to drain the node it complains that they are managed via DaemonSet, so I could force it but obviously I am trying to not have to manually intervene in any way.

Hack

To get the autoscaler to "work" and downsize to 0, I've temporarily added a nodeSelector to all the kube-system deployments so they are assigned to a separate pool for kube-system stuff. But there has to be a better way, right?

Could you tell what is your Kubernetes cluster version on GKE? — Dawid Kruk, Feb 13 '20 at 10:12
I edited my answer. Please take a look and feel free to ask if something is not clear. — Dawid Kruk, Feb 19 '20 at 15:37

Dawid Kruk · Answer 1 · 2020-02-19T15:36:39.860

Autoscaler will not reduce your node pool to 0.

Note: If you specify a minimum of zero nodes, an idle node pool can scale down completely. However, at least one node must always be available in the cluster to run system Pods.

-- Google Cloud: Kubernetes engine cluster autoscaler

However, cluster autoscaler cannot completely scale down to zero a whole cluster. At least one node must always be available in the cluster to run system pods. So you need to keep at least one node. But this doesn’t mean you need to keep one expensive node running idle.

-- Medium.com: Scale your kubernetes cluster to almost zero with gke autoscaler

You can explicitly reduce your node pool to zero (0) with command:

$ gcloud container clusters resize CLUSTER_NAME --node-pool NAME_OF_THE_POOL --num-nodes 0

But be aware that this approach will have a drawback.

Image a situation where:

You scale down cluster to zero nodes with command above
You create a workload on the cluster that has zero nodes

Autoscaler will not be able to increase a number of nodes from zero. It will not have the means to tell if additional resources are required. The pods that were running in kube-system on those nodes were essential to determine if another node is required.

There is an article with use case similar to yours. Please take a look: Medium.com: Scale your kubernetes cluster to almost zero with gke autoscaler

Another way to do it is with pod disruption budgets. Please take a look on below resources:

Possible reasons that can prevent cluster autoscaler from removing a node:

Pods with restrictive PodDisruptionBudget.

Kube-system pods that:

are not run on the node by default,

don't have a pod disruption budget set or their PDB is too restrictive (since CA 0.6).

Pods that are not backed by a controller object (so not created by deployment, replica set, job, stateful set etc).

Pods with local storage.

Pods that cannot be moved elsewhere due to various constraints (lack of resources, non-matching node selectors or affinity, matching anti-affinity, etc)

Pods that have the following annotation set: "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"

Unless the pod has the following annotation (supported in CA 1.0.3 or later):

"cluster-autoscaler.kubernetes.io/safe-to-evict": "true"

-- Github.com: Kubernetes autoscaler: what types of pods can prevent ca from removing a node

CA doesn't remove underutilized nodes if they are running pods that it shouldn't evict

Other possible reasons for not scaling down:

the node group already has the minimum size,

there was a failed attempt to remove this particular node, in which case Cluster Autoscaler will wait for extra 5 minutes before considering it for removal again,

-- Github.com: I have a couple of nodes with low utilization but they are not scaled-down why

Thanks for a detailed answer! Incidentally, Google docs seem to suggest this behavior is now changing with version 1.22: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler#known_issues — dan, May 12 '22 at 04:29

score 1 · Accepted Answer · answered Dec 13 '20 at 02:10

On GKE 1.18, my experiments show that I'd have to add a node taint in order to make the node pool able to shrink to zero:

$ gcloud container node-pools create ... \
      --min-nodes 0 \
      --max-nodes 2 \
      --node-taints=...  # Without a taint, my node pool won't scale down to zero somehow.

Kubernetes node pool will not autoscale to 0 nodes

2 Answers2