2

I have implemented HPA for all the pods based on CPU and it was working as expected. But when we did a maintenance of worker nodes, it seems tha HPA's got messed up as it failed to identify it. Do I need to disable HPA temporarily during maintenance and bring it up once the maitenance is over.

Please suggest

HPA Manifest -

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: pod-name-cpu
spec:
  maxReplicas: 6
  minReplicas: 2
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: pod-name
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
dvlpr
  • 311
  • 3
  • 17
  • Can you share your HPA manifest and how did you perform maintenance? Did you cordon the nodes? – kool Feb 18 '21 at 16:27
  • HPA Manifest added in the original question. Yes the node is cordoned – dvlpr Feb 18 '21 at 18:12
  • What exactly do you mean by "maintenance"? Was an update strategy for deployments used? What was expected and how did HPA react? Please elaborate more. – Wytrzymały Wiktor Mar 03 '21 at 14:21

1 Answers1

1

There is a maintenance-mode solution which says:

You can implicitly deactivate the HPA for a target without the need to change the HPA configuration itself. If the target's desired replica count is set to 0, and the HPA's minimum replica count is greater than 0, the HPA stops adjusting the target (and sets the ScalingActive Condition on itself to false) until you reactivate it by manually adjusting the target's desired replica count or HPA's minimum replica count.

EDIT:

To explain more the above, the things you should do are:

  • Scale your deployment to 0

  • Describe your HPA

  • Notice that under the Conditions: section the ScalingActive will turn to False which will disable HPA until you set the replicas back to desired value

  • See more here

Also, as you did not specify what exactly happened and what is the desired outcome you might also consider moving your workload into a different node. How to perform Disruptive Actions on your Cluster has a few options for you to choose from.

Wytrzymały Wiktor
  • 11,492
  • 5
  • 29
  • 37