In my aws account I have a ASG setup for my SQS consumer. It has a min capacity of 3 and max capacity of 8. The termination policy is set to "default". It has 2 simple scaling policies which are attached to a cloud watch alarm which monitors the size of the SQS queue.
Here is the threshold for the cloud watch alarm ApproximateNumberOfMessagesVisible >= 10 for 1 consecutive periods of 300 seconds for the metric dimensions
.
When the cloud watch alarm state is "alarming" after 300 seconds then the ASG adds 1 instance until it hits the max capacity. Likewise, when the cloud watch alarm state is "ok" after 300 seconds then the ASG removes 1 instance until it hits the min capacity.
The ASG seems to scale up to max capacity with no issues. The problem I'm running into however, occurs when the ASG scales back down. When the alarm state goes from "alarming" back to "ok" the ASG just seems to randomly pick an instance to shutdown. This is a problem if the instance it is shutting down is currently processing an SQS message.
For example, if my SQS queue has 20 visible messages then my ASG will scale up, lets say to 8. Once the visible messages are below or equal to 10 the ASG will start to terminate instances from my ASG. But, it might pick a instance which is processing an SQS message. If it does, then that SQS message goes into my DLQ.
Has anyone run into this issue before?
Is there a way to configure the ASG to monitor the SQS length and only terminate instances which have finished processing a messages? Maybe when the SQS is "ok" and the instance has low CPU? Or, should I be setting the threshold in my cloud watch alarm to something like 2?