0

I'm trying to set up an autoscaling Fargate cluster for GitHub self-hosted runners. The high-level design for this looks like this –

  1. A GitHub app will send a webhook event to a Lambda behind an API gateway.
  2. The Lambda will put a custom COUNT metric with value 1 if the request is for a new workflow, and a -1 for a completed or cancelled workflow. The metric will include the repo owner (REPO_OWNER), the repo name (REPO_NAME), event type (EVENT_TYPE, which I know will always be workflow_job) and the workflow run ID (ID) as dimensions.
  3. 2 app autoscaling policies (up and down) will change the ecs:service:DesiredCount dimension based on the value of the custom metric.
  4. 2 Cloudwatch metric alarms (up and down) will attach the above 2 policies whenever the scaling thresholds are breached.
const autoscalingTarget = new AppautoscalingTarget(this, `appautoscaling-target-${environment}`, {
  serviceNamespace: 'ecs',
  resourceId: `service/${ecsCluster.awsEcsClusterClusterNameOutput}/${ecsService.awsEcsServiceServiceNameOutput}`,
  scalableDimension: 'ecs:service:DesiredCount',
  minCapacity: 0,
  maxCapacity: options.maxClusterSize,
})

const scaleUpPolicy = new AppautoscalingPolicy(this, `autoscale-up-policy-${environment}`, {
  dependsOn: [autoscalingTarget],
  name: `autoscale-up-policy-${environment}`,
  serviceNamespace: 'ecs',
  resourceId: `service/${ecsCluster.awsEcsClusterClusterNameOutput}/${ecsService.awsEcsServiceServiceNameOutput}`,
  scalableDimension: 'ecs:service:DesiredCount',
  stepScalingPolicyConfiguration: {
    adjustmentType: 'ChangeInCapacity',
    cooldown: 30,
    metricAggregationType: 'Maximum',
    stepAdjustment: [{
      metricIntervalLowerBound: '1',
      scalingAdjustment: 1,
    }]
  },
})

const scaleDownPolicy = new AppautoscalingPolicy(this, `autoscale-down-policy-${environment}`, {
  dependsOn: [autoscalingTarget],
  name: `autoscale-down-policy-${environment}`,
  serviceNamespace: 'ecs',
  resourceId: `service/${ecsCluster.awsEcsClusterClusterNameOutput}/${ecsService.awsEcsServiceServiceNameOutput}`,
  scalableDimension: 'ecs:service:DesiredCount',
  stepScalingPolicyConfiguration: {
    adjustmentType: 'ChangeInCapacity',
    cooldown: 30,
    metricAggregationType: 'Maximum',
    stepAdjustment: [{
      metricIntervalUpperBound: '0',
      scalingAdjustment: -1,
    }]
  }
})

const alarmPeriod = 120 as const

new CloudwatchMetricAlarm(this, `autoscale-up-alarm-${environment}`, {
  alarmName: `fargate-cluster-scale-up-alarm-${environment}`,
  metricName: options.customCloudWatchMetricName,
  namespace: options.customCloudWatchMetricNamespace,
  alarmDescription: `Scales up the Fargate cluster based on the ${options.customCloudWatchMetricNamespace}.${options.customCloudWatchMetricName} metric`,
  comparisonOperator: 'GreaterThanThreshold',
  threshold: 0,
  evaluationPeriods: 1,
  metricQuery: [{
        id: 'm1',
        metric: {
          metricName: options.customCloudWatchMetricName,
          namespace: options.customCloudWatchMetricNamespace,
          period: alarmPeriod,
          stat: 'Sum',
          unit: 'Count',
          dimensions:
          {
            // Note: this is the only dimension I can know in advance
            EVENT_TYPE: 'workflow_job',
          },
        },
      }, {
        id: 'm2',
        metric: {
          metricName: options.customCloudWatchMetricName,
          namespace: options.customCloudWatchMetricNamespace,
          period: alarmPeriod,
          stat: 'Sum',
          unit: 'Count',
          dimensions:
          {
            // Note: this is the only dimension I can know in advance
            EVENT_TYPE: 'workflow_job',
          },
        },
      }, {
        id: 'e1',
        expression: 'SUM(METRICS())',
        label: 'Sum of Actions Runner Requests',
        returnData: true,
  }],
  alarmActions: [
    scaleUpPolicy.arn,
  ],
  actionsEnabled: true,
})

new CloudwatchMetricAlarm(this, `autoscale-down-alarm-${environment}`, {
  alarmName: `fargate-cluster-scale-down-alarm-${environment}`,
  alarmDescription: `Scales down the Fargate cluster based on the ${options.customCloudWatchMetricNamespace}.${options.customCloudWatchMetricName} metric`,
  comparisonOperator: 'LessThanThreshold',
  threshold: 1,
  period: alarmPeriod,
  evaluationPeriods: 1,
  metricQuery: [{
        id: 'm1',
        metric: {
          metricName: options.customCloudWatchMetricName,
          namespace: options.customCloudWatchMetricNamespace,
          period: alarmPeriod,
          stat: 'Sum',
          unit: 'Count',
          dimensions: {
            // Note: this is the only dimension I can know in advance
            EVENT_TYPE: 'workflow_job',
          }
        },
      }, {
        id: 'm2',
        metric: {
          metricName: options.customCloudWatchMetricName,
          namespace: options.customCloudWatchMetricNamespace,
          period: alarmPeriod,
          stat: 'Sum',
          unit: 'Count',
          dimensions: {
            // Note: this is the only dimension I can know in advance
            EVENT_TYPE: 'workflow_job',
          }
        },
      }, {
        id: 'e1',
        expression: 'SUM(METRICS())',
        label: 'Sum of Actions Runner Requests',
        returnData: true,
  }],
  alarmActions: [
    scaleDownPolicy.arn,
  ],
  actionsEnabled: true,
})

I do not see the metrics showing data nor the alarm changing states until I add all the 4 dimensions. Adding only 1 dimension (EVENT_TYPE, which is the only static dimension) gives me no data, but adding all 4 does.

How do I model my metrics so I can continue adding more dynamic metadata as dimensions but still set up working alarms based on well-known static dimensions?

GPX
  • 3,506
  • 10
  • 52
  • 69
  • 4
    If you can see the metrics in CloudWatch, but the alarms are in "Insufficient data" then you missed some setting on the alarm that is preventing it from actually pulling in the metric. Whenever I run into this sort of thing I create the same alarm manually in the CloudWatch web console, and then compare that to the one Terraform created to see what the difference is. It's usually something I missed in the `dimensions` block of the alarm. – Mark B Jun 08 '22 at 11:56
  • @MarkB You were right, the `dimensions` were the problem. If I add all the dimensions, I can see that the alarms changes state. However, my problem is that most of the dimensions are dynamic except for 1. If I query only by that single static dimension, I don't see the data again. How would you recommend I solve this? – GPX Jun 08 '22 at 13:50
  • 1
    I think you'll need to provide some actual concrete info on what dimensions you are needing to track, and which ones are dynamic, in order for anyone to help. – Mark B Jun 08 '22 at 13:57
  • @MarkB I've updated the original post with additional details. – GPX Jun 08 '22 at 14:11
  • 1
    You're adding too many custom dimensions to your custom metrics. Each combination of dimensions is a new metric. So you are creating N number of custom CloudWatch metrics (the number of your dynamic values), but you want to auto-scale based on only one of those dimensions. You could create another custom metric from your Lambda functions, that only has the one static dimension, or you could possibly do some sort of CloudWatch metric math to combine the metrics (I'm not sure that will work), or you could remove the dynamic dimensions from your current metrics to combine them. – Mark B Jun 08 '22 at 14:20

1 Answers1

0

I was able to solve this by removing all dimensions on the Cloudwatch metrics.

GPX
  • 3,506
  • 10
  • 52
  • 69