I have an ECS cluster backed by EC2 machines in an autoscaling group.
The cluster uses capacity provider described in CloudFormation with the following code:
CapacityProvider:
Type: AWS::ECS::CapacityProvider
Condition: EnableInstanceAutoScaling
Properties:
AutoScalingGroupProvider:
AutoScalingGroupArn: !Ref InstanceAutoScalingGroup
ManagedScaling:
MaximumScalingStepSize: 10
MinimumScalingStepSize: 1
Status: ENABLED
TargetCapacity: 100
ManagedTerminationProtection: ENABLED
Notice that both ManagedScaling
and ManagedTerminationProtection
are ENABLED
.
Now, following this I also set NewInstancesProtectedFromScaleIn
to true
:
If managed termination protection is enabled when you create a capacity provider, the Auto Scaling group and each Amazon EC2 instance in the Auto Scaling group must have instance protection from scale in enabled as well.
It all works fine, but sometimes the EC2 instances are stuck inside ASG:
- they are unregistered from the ECS Cluster (aka not listed there anymore);
- they still have scale-in protection enabled;
- ASG cannot terminate them:
It doesn't happen to all the instances, only to some and I have no idea which ones. I don't have any lifecycle hooks. This leads to the ASG getting filled with unused resources (equals money) up to the point when it cannot scale out anymore, cause it has reached the maximum capacity.
Then I also found this post about similar problem with Batch, where the suggested answer was to disable the ASG Scale-in protection.
Any suggestions on how I can diagnose/fix the problem?
*P.S. During this the ASG will have desired capaticy set to e.g. 1 and actively trying to scale in.