0

I have configured an ECS Cluster deployment as detailed below:

  1. Services are launched as Fargate instances
  2. Services will scale in/out based on the size of an SQS queue
  3. Each service produces a metrics time-series
  4. A Cloudwatch Agent instance is deployed to pull the metrics from the service instances every 1 minute

The problem I have is:

  1. One service instance is running
  2. Queue size increases an triggers scaling-out
  3. Another service instance is started
  4. Cloudwatch Agent is pulling metrics from both instances
  5. Queue size decreases and triggers scaling-in
  6. A service instance is retired within 30 seconds
  7. However the Cloudwatch Agent has not collected the metrics from the retired instance in time and those metrics are now lost

What techniques have people used to combat this issue?

The only solution I can think is to add a sleep to my service that causes it to wait 60+ seconds when it is signalled to terminate (also extending the ecs_stop_container_timeout) giving the Cloudwatch Agent time to retrieve the final set of metrics. This may work but it feels like a hack.

Thanks

1 Answers1

0

I would say you are on the right track though a sleep is a bit blunt. See this as a useful guide on how to exit gracefully:

https://aws.amazon.com/blogs/containers/graceful-shutdowns-with-ecs/

  • Yes I dislike the sleep, but having tested my theory out it seems to do the job. Additionally, if a service is slow to stop when it is being retired as part of a scaling-in action, it doesn't really matter if it takes a minute extra to completely stop. – nosajsnikta Aug 09 '21 at 21:54