I have a rather expensive workload that some colleagues need running sometimes during the weekday (not on any sort of set schedule). I use Google Cloud Kubernetes.
It consists of three statefulsets, each with one replica.
I've instructed them how to turn it "on" and "off." To turn it "on," they scale each statefulset to 1 replica. To turn it "off," they scale each statefulset to 0 replicas.
Originally, I had a single autoscaling node pool with a default size of three nodes (the statefulsets each consume almost an entire node's worth of CPU and RAM). I observed that even after scaling down to 0, at least one (and sometimes two) nodes would remain after an hour or two. I was expecting that eventually all the nodes would die down, but that doesn't happen.
I noticed that the running nodes still had some pods, just in a different namespace. The remaining pods are all in the kube-system
namespace, except for one in the custom-metrics
namespace.
So then I thought, okay - maybe there are other services Kubernetes wants to run even when there are no user-defined workloads/pods. So I created another node pool, with a single very-small-but-adequate node. That node is big enough to run everything that Kubernetes reports is running in those non-default
namespaces.
After the new node pool was running with one node, I then proceeded to manually resize the original node pool to 0. It was fine. I hoped at this point that I had a "system" node pool for running kube-system
and other stuff, and a "user" node pool for running my own stuff.
So for my next test, this time I only scaled up one statefulset replica. Eventually a node came online and the statefulset pod was running/ready. I then scaled it down to 0 again and waited... and waited... and the node did not go away.
What does it take to make the autoscaling node pool actually reach 0 nodes? Clearly I am missing something (or more than something), but I have had a hard time finding information about what is necessary to trigger the node scaler to downsize a node pool to 0.
Any advice is appreciated.
Additional info
When I look at what's running on the node in the node pool I want to go to 0, here's what I see
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system fluentd-gcp-v3.1.1-mfkxf 100m (0%) 1 (3%) 200Mi (0%) 500Mi (0%) 28m
kube-system kube-proxy-gke-tileperformance-pool-1-14d3671d-jl76 100m (0%) 0 (0%) 0 (0%) 0 (0%) 28m
kube-system prometheus-to-sd-htvnw 1m (0%) 3m (0%) 20Mi (0%) 20Mi (0%) 28m
If I try to drain
the node it complains that they are managed via DaemonSet
, so I could force it but obviously I am trying to not have to manually intervene in any way.
Hack
To get the autoscaler to "work" and downsize to 0, I've temporarily added a nodeSelector
to all the kube-system
deployments so they are assigned to a separate pool for kube-system
stuff. But there has to be a better way, right?