I have configured an ECS Cluster deployment as detailed below:
- Services are launched as Fargate instances
- Services will scale in/out based on the size of an SQS queue
- Each service produces a metrics time-series
- A Cloudwatch Agent instance is deployed to pull the metrics from the service instances every 1 minute
The problem I have is:
- One service instance is running
- Queue size increases an triggers scaling-out
- Another service instance is started
- Cloudwatch Agent is pulling metrics from both instances
- Queue size decreases and triggers scaling-in
- A service instance is retired within 30 seconds
- However the Cloudwatch Agent has not collected the metrics from the retired instance in time and those metrics are now lost
What techniques have people used to combat this issue?
The only solution I can think is to add a sleep to my service that causes it to wait 60+ seconds when it is signalled to terminate (also extending the ecs_stop_container_timeout) giving the Cloudwatch Agent time to retrieve the final set of metrics. This may work but it feels like a hack.
Thanks