0

GKE managed kubernetes 1.22.12-gke.500, calico CNI v3.21.5-gke.1 provided by google
and included is calico-node-vertical-autoscaler (cpvpa:v0.8.3-gke.1). https://github.com/kubernetes-sigs/cluster-proportional-vertical-autoscaler/tree/v0.8.3
Default config provided by GKE for VPA:

{
  "calico-node": {

    "requests": {
      "cpu": {
        "base": "80m",
        "step": "20m",
        "nodesPerStep": 10,
        "max": "500m"
      }
    }
  }
}

Anyway, what I'm finding is during PEAK TRAFFIC some no-cpu-limit services are bursting way above 100% CPU request. Also some cluster-autoscaler activity scales out new worker nodes (to accommodate HPA triggered workloads replicas), and this VPA will 'intelligently' modify the calico-node CPU request A TINY AMOUNT, but this still causes every single daemonset pod to be restarted with the new config.
Sometimes this restarting DS pod cannot actually FIT ON the node again,
see: https://www.googlecloudcommunity.com/gc/Google-Kubernetes-Engine-GKE/Calico-node-cant-start-up-due-to-not-enough-resources-on-node/m-p/455158/highlight/true

My cluster node count will not vary THAT much, maybe 30<-->40 nodes max.

  1. Why would google recommend VPA on something as critical as calico-node, and risk network disruption with calico-node failed & delayed restarts?
  2. Should I just disable this VPA, setting a max cpu request based on my cluster traffic history, and revisit it again if my cluster ever scales to say 50,60 nodes?
siwasaki
  • 263
  • 2
  • 9
  • Side question: is there a reason you are not using Dataplane V2 (which does not require Calico for network policy)? – Gari Singh Sep 11 '22 at 10:12
  • You should be able to modify the ConfigMap, changing the settings for calico-node. Calico definitely requires more CPU as the number of nodes increases, so you'd probably want to increase the "base" and then perhaps change "nodesPerStep" to a value higher than the number of nodes you expect to have in your cluster. – Gari Singh Sep 11 '22 at 10:16

0 Answers0