2

I see there are 2 separate metrics ApproximateNumberOfMessagesVisible and ApproximateNumberOfMessagesNotVisible.

Using number of messages visible causes processing pods to get triggered for termination immediately after they pick up the message from queue, as they're no longer visible. If I use number of messages not visible, it will not scale up.

I'm trying to scale a kubernetes service using horizontal pod autoscaler and external metric from SQS. Here is template external metric:

apiVersion: metrics.aws/v1alpha1
kind: ExternalMetric
metadata:
  name: metric-name
spec:
  name: metric-name
  queries:
    - id: metric_name
      metricStat:
        metric:
          namespace: "AWS/SQS"
          metricName: "ApproximateNumberOfMessagesVisible"
          dimensions:
            - name: QueueName
              value: "queue_name"
        period: 60
        stat: Average
        unit: Count
      returnData: true

Here is HPA template:

kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2beta1
metadata:
  name: hpa-name
spec:
  scaleTargetRef:
    apiVersion: apps/v1beta1
    kind: Deployment
    name: deployment-name
  minReplicas: 1
  maxReplicas: 50
  metrics:
  - type: External
    external:
      metricName: metric-name
      targetAverageValue: 1

The problem will be solved if I can define another custom metric that is a sum of these two metrics, how else can I solve this problem?

Chakradar Raju
  • 2,691
  • 2
  • 26
  • 42
  • This looks an interesting use case. Looking at your HPA definition, the rule starts scaling out when the number of messages visible in your queue exceeds 1, and scales in when there are fewer than 1 message(s) in the queue. So with HPA keeping an eye on `ApproximateNumberOfMessagesVisible`, isn't it alright if the processing pods terminate if there are no more messages visible? or do you see that HPA was instructing pods to scale down even when there were messages visible? – rock'n rolla Mar 19 '21 at 13:24
  • Lets say there are 20 messages being processed and 20 pods running, there will be 0 left visible, so HPA will try to terminate all of them. Being in terminating state can also be ignored, problem is another 20 comes into queue, HPA doesn't scale up because there are already 20 running pods. Because of this non ideal metric, processing latency is increasing. – Chakradar Raju Mar 21 '21 at 14:22

2 Answers2

1

We used a lambda to fetch two metrics and publish a custom metric that is sum of messages in-flight and waiting, and trigger this lambda using cloudwatch events at whatever frequency you want, https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#rules:action=create

Here is lambda code for reference:

const AWS = require('aws-sdk');
const cloudwatch = new AWS.CloudWatch({region: ''});  // fill region here
const sqs = new AWS.SQS();

const SQS_URL = ''; // fill queue url here

async function getSqsMetric(queueUrl) {
    var params = {
      QueueUrl: queueUrl,
      AttributeNames: ['All']
    };
    return new Promise((res, rej) => {
        sqs.getQueueAttributes(params, function(err, data) {
            if (err) rej(err);
            else res(data);
        });
    })
    
}

function buildMetric(numMessages) {
  return {
    Namespace: 'yourcompany-custom-metrics',
    MetricData: [{
      MetricName: 'mymetric',
      Dimensions: [{
          Name: 'env',
          Value: 'prod'
      }],
      Timestamp: new Date(),
      Unit: 'Count',
      Value: numMessages
    }]
  }
}

async function pushMetrics(metrics) {
  await new Promise((res) => cloudwatch.putMetricData(metrics, (err, data) => {
    if (err) {
      console.log('err', err, err.stack); // an error occurred
      res(err);
    } else {
      console.log('response', data);           // successful response
      res(data);
    }
  }));
}

exports.handler = async (event) => {
    console.log('Started');
    const sqsMetrics = await getSqsMetric(SQS_URL).catch(console.error);
    var queueSize = null;
    if (sqsMetrics) {
        console.log('Got sqsMetrics', sqsMetrics);
        if (sqsMetrics.Attributes) {
          queueSize = parseInt(sqsMetrics.Attributes.ApproximateNumberOfMessages) + parseInt(sqsMetrics.Attributes.ApproximateNumberOfMessagesNotVisible);
          console.log('Pushing', queueSize);
          await pushMetrics(buildMetric(queueSize))
        } 
    } else {
        console.log('Failed fetching sqsMetrics');
    }
    const response = {
        statusCode: 200,
        body: JSON.stringify('Pushed ' + queueSize),
    };
    return response;
};
Chakradar Raju
  • 2,691
  • 2
  • 26
  • 42
0

This seems to be a case of Thrashing - the number of replicas keeps fluctuating frequently due to the dynamic nature of the metrics evaluated.

IMHO, you've got a couple of options here. You could look at adding a StabilizationWindow to your HPA and also probably limit the scale down rate. You'd have to try a few combination of metrics and see what works best for you as you'd best know the nature of metrics (ApproximateNumberOfMessagesVisible in this case) you see in your infrastructure.

rock'n rolla
  • 1,883
  • 1
  • 13
  • 19