Cloudwatch Agent loses metrics when service is scaled-in

Question

I have configured an ECS Cluster deployment as detailed below:

Services are launched as Fargate instances
Services will scale in/out based on the size of an SQS queue
Each service produces a metrics time-series
A Cloudwatch Agent instance is deployed to pull the metrics from the service instances every 1 minute

The problem I have is:

One service instance is running
Queue size increases an triggers scaling-out
Another service instance is started
Cloudwatch Agent is pulling metrics from both instances
Queue size decreases and triggers scaling-in
A service instance is retired within 30 seconds
However the Cloudwatch Agent has not collected the metrics from the retired instance in time and those metrics are now lost

What techniques have people used to combat this issue?

The only solution I can think is to add a sleep to my service that causes it to wait 60+ seconds when it is signalled to terminate (also extending the ecs_stop_container_timeout) giving the Cloudwatch Agent time to retrieve the final set of metrics. This may work but it feels like a hack.

Thanks

score 0 · Answer 1 · answered Aug 09 '21 at 21:43

0

I would say you are on the right track though a sleep is a bit blunt. See this as a useful guide on how to exit gracefully:

https://aws.amazon.com/blogs/containers/graceful-shutdowns-with-ecs/

answered Aug 09 '21 at 21:43

Ronan Cunningham

344
1
7

Yes I dislike the sleep, but having tested my theory out it seems to do the job. Additionally, if a service is slow to stop when it is being retired as part of a scaling-in action, it doesn't really matter if it takes a minute extra to completely stop. – nosajsnikta Aug 09 '21 at 21:54

Cloudwatch Agent loses metrics when service is scaled-in

1 Answers1