GKE managed kubernetes 1.22.12-gke.500,
calico CNI v3.21.5-gke.1 provided by google
and included is calico-node-vertical-autoscaler (cpvpa:v0.8.3-gke.1). https://github.com/kubernetes-sigs/cluster-proportional-vertical-autoscaler/tree/v0.8.3
Default config provided by GKE for VPA:
{
"calico-node": {
"requests": {
"cpu": {
"base": "80m",
"step": "20m",
"nodesPerStep": 10,
"max": "500m"
}
}
}
}
Anyway, what I'm finding is during PEAK TRAFFIC some no-cpu-limit services are bursting way above 100% CPU request. Also some cluster-autoscaler activity scales out new worker nodes (to accommodate HPA triggered workloads replicas), and this VPA will 'intelligently' modify the calico-node CPU request A TINY AMOUNT, but this still causes every single daemonset pod to be restarted with the new config.
Sometimes this restarting DS pod cannot actually FIT ON the node again,
see: https://www.googlecloudcommunity.com/gc/Google-Kubernetes-Engine-GKE/Calico-node-cant-start-up-due-to-not-enough-resources-on-node/m-p/455158/highlight/true
My cluster node count will not vary THAT much, maybe 30<-->40 nodes max.
- Why would google recommend VPA on something as critical as calico-node, and risk network disruption with calico-node failed & delayed restarts?
- Should I just disable this VPA, setting a max cpu request based on my cluster traffic history, and revisit it again if my cluster ever scales to say 50,60 nodes?