7

We have a setup with AWS ECS task scheduled with CloudWatch Events / EventBridge. We'd like to have metrics & notification for failed runs based on container exit code.

We were planning to use FailedInvocations from Monitoring Usage with CloudWatch Metrics.

However, it seems that non-zero task exit code isn't seen in the metrics. The exit code for the ECS task is verified to be non-zero in AWS console but the metrics include only "Invocations" & "TriggeredRules". We had FailedInvocations earlier when setting up the task and missing policies required to start the task but non-zero exit code doesn't seem to effect that metric.

Is it just that EventBridge doesn't provide metrics for a non-zero container exit code or could we miss something in our setup?

We can work this around with the task logging certain error message but exit code would be more general.

Touko
  • 11,359
  • 16
  • 75
  • 105
  • 1
    You might want to look at creating an Event which is triggered when the Task reaches STOPPED see https://stackoverflow.com/questions/55176187/creating-an-cloudwatch-event-rule-for-failed-ecs-tasks – Tom Harvey Dec 11 '20 at 15:03

1 Answers1

3

I've created a CW Rule which will catch the event which is fired when a container stops.

This is it in python CDK code, but the params should help guide you.

        result_rule = events.Rule(self, 'TaskCompletion%s' % id_suffix,
            event_pattern=events.EventPattern(
                source=["aws.ecs"],
                detail_type=["ECS Task State Change"],
                region=[scope.env.region],
                detail={
                    "lastStatus": ["STOPPED"],
                    "containers":
                        {
                            "name": [container_name]
                        }
                    }
            ),
            targets=[
                targets.SqsQueue(
                    queue=scope.result_queue,
                )
            ]
        )

It sends the entire context of the Event into a SQS queue ( can also be an SNS and you'll probably want to get a Lambda to process this)

The lambda can then read the exit code from that context, and:

  • Container/Task name
  • Start and Stop time
  • CPU and Memory allocation

Allowing you to create a metric on exit code (0 or non-0) but also to report the run time metrics.

Tom Harvey
  • 3,681
  • 3
  • 19
  • 28