0

(sorry in advance as i am a newbie in aws).

I am using a cloudformation stack to manage my ECS cluster.

Let's say we have an ASG with a desired capacity of 5 ec2 instances (minSize: 1, maxSize:7), and i am manually changing the value of the desired capacity from 5 to 2, it reduces the number of instances through the change set of a cluster, all instances are shutting down at once. It gives no time to dispatch back the previous container on the left instances. So, if going from 5 to 2 instances, all 3 instances are shut down directly. If by bad luck all the containers of one type were on the 3 machines, no container is existing anymore and the service is down.

is it possible to have a "cooldown" between each termination ? using a scaling policy won't obviously help since we do not want to setup a metric as the available metrics do not help in my case.

Please find hereunder some logs:

2021-01-15 15:45:52 UTC+0100    autoScalingGroup    UPDATE_IN_PROGRESS  Rolling update initiated. Terminating 3 obsolete instance(s) in batches of 1, while keeping at least 1 instance(s) in service. Waiting on resource signals with a timeout of PT5M when new instances are added to the autoscaling group.
2021-01-15 15:45:52 UTC+0100    autoScalingGroup    UPDATE_IN_PROGRESS  Temporarily setting autoscaling group MinSize and DesiredCapacity to 3.
2021-01-15 15:45:54 UTC+0100    autoScalingGroup    UPDATE_IN_PROGRESS  Terminating instance(s) [i-X]; replacing with 1 new instance(s).
2021-01-15 15:47:40 UTC+0100    autoScalingGroup    UPDATE_IN_PROGRESS  New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT5M.
2021-01-15 15:47:40 UTC+0100    autoScalingGroup    UPDATE_IN_PROGRESS  Successfully terminated instance(s) [i-X] (Progress 33%).
2021-01-15 15:52:42 UTC+0100    autoScalingGroup    UPDATE_IN_PROGRESS  Terminating instance(s) [i-X]; replacing with 1 new instance(s).
2021-01-15 15:53:59 UTC+0100    autoScalingGroup    UPDATE_IN_PROGRESS  New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT5M.
2021-01-15 15:53:59 UTC+0100    autoScalingGroup    UPDATE_IN_PROGRESS  Successfully terminated instance(s) [i-X] (Progress 67%).
2021-01-15 15:59:02 UTC+0100    dev-cluster UPDATE_ROLLBACK_IN_PROGRESS The following resource(s) failed to update: [autoScalingGroup].
2021-01-15 15:59:17 UTC+0100    securityGroup   UPDATE_IN_PROGRESS  -
2021-01-15 15:59:32 UTC+0100    securityGroup   UPDATE_COMPLETE -
2021-01-15 15:59:33 UTC+0100    launchConfiguration UPDATE_COMPLETE -
2021-01-15 15:59:34 UTC+0100    autoScalingGroup    UPDATE_IN_PROGRESS  -
2021-01-15 15:59:37 UTC+0100    autoScalingGroup    UPDATE_IN_PROGRESS  Rolling update initiated. Terminating 2 obsolete instance(s) in batches of 1, while keeping at least 1 instance(s) in service. Waiting on resource signals with a timeout of PT5M when new instances are added to the autoscaling group.
2021-01-15 15:59:37 UTC+0100    autoScalingGroup    UPDATE_IN_PROGRESS  Temporarily setting autoscaling group MinSize and DesiredCapacity to 3.
2021-01-15 15:59:38 UTC+0100    autoScalingGroup    UPDATE_IN_PROGRESS  Terminating instance(s) [i-X]; replacing with 1 new instance(s).
2021-01-15 16:01:25 UTC+0100    autoScalingGroup    UPDATE_IN_PROGRESS  New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT5M.
2021-01-15 16:01:25 UTC+0100    autoScalingGroup    UPDATE_IN_PROGRESS  Successfully terminated instance(s) [i-X] (Progress 50%).
2021-01-15 16:01:46 UTC+0100    autoScalingGroup    UPDATE_IN_PROGRESS  Received SUCCESS signal with UniqueId i-X
2021-01-15 16:01:47 UTC+0100    autoScalingGroup    UPDATE_IN_PROGRESS  Terminating instance(s) [i-X]; replacing with 1 new instance(s).
2021-01-15 16:03:34 UTC+0100    autoScalingGroup    UPDATE_IN_PROGRESS  New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT5M.
2021-01-15 16:03:34 UTC+0100    autoScalingGroup    UPDATE_IN_PROGRESS  Received SUCCESS signal with UniqueId i-X
2021-01-15 16:03:34 UTC+0100    autoScalingGroup    UPDATE_IN_PROGRESS  Successfully terminated instance(s) [i-X] (Progress 100%).
2021-01-15 16:03:37 UTC+0100    autoScalingGroup    UPDATE_COMPLETE -
2021-01-15 16:03:37 UTC+0100    dev-cluster UPDATE_ROLLBACK_COMPLETE_CLEANUP_IN_PROGRESS    -
2021-01-15 16:03:38 UTC+0100    launchConfiguration DELETE_IN_PROGRESS  -
2021-01-15 16:03:39 UTC+0100    dev-cluster UPDATE_ROLLBACK_COMPLETE    -
2021-01-15 16:03:39 UTC+0100    launchConfiguration DELETE_COMPLETE -

Thanks in advance for your help !

serialp
  • 3
  • 2
  • Does [instance scale-in protection for a group](https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-instance-termination.html#instance-protection) or [Enable EC2 termination protection](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/terminating-instances.html#Using_ChangingDisableAPITermination) suits your use-case? – amitd Jan 26 '21 at 14:35
  • Unfortunately no @amitd since reducing the desired capacity is for surely terminate some instances, so using the scale-in protection or ec2 termination protection won't help in that case. thank you – serialp Jan 26 '21 at 15:12
  • AutoScaling ignores EC2 termination protection. And when CloudFormation is terminating instances it uses the terminate-instance-in-auto-scaling-group API call, which ignores scale in protection. If you manually changed the desired, then scale in protection would be honored – Shahad Jan 28 '21 at 14:48

1 Answers1

0

For your direct question, there is no feature to force an ASG to only remove x instances at a time when a drop in the desired happens

If you don't already, you should have a lifecycle hook on the ASG to trigger a script telling ECS to drain the containers off the instances (I'm assuming from the context your using ECS). You would still need to manually lower the desired 1 at a time in this case though. https://aws.amazon.com/blogs/compute/how-to-automate-container-instance-draining-in-amazon-ecs/

If your lowering the desired in CloudFormation, then you could have an UpdatePolicy attached to the group telling CFN to do a RollingUpdate to replace the instances 1 at a time in batches https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html

If you are using ECS, setting up 2 target tracking scaling policies is usually a good idea. 1 for CPUReservation and 1 for MemoryReservation. You could also manually create step scaling policies based on these metrics if you want to force the ASG to never scale-in by more than 1 instance at a time, but creating 4 cloudwatch alarms in CFN would be a pain

Another option would be to use a CapacityProvider in ECS, which will enable instance protection on any instances with a task running on them

Shahad
  • 771
  • 4
  • 6
  • Thanks a lot @Shahad. Indeed I did not yet use lifecycle-hook and I am intended to use it as I have also ASG issue when updating my instances with new AMI version. Regarding the Update Policy I am using one indeed, How w would you implement the batches in order to be able to terminate 1 instance at a time ? Thanks again ! – serialp Feb 01 '21 at 13:49