2

I'm looking for a procedure that I can use to replace a specific instance in an AWS scalegroup, all the while maintaining AZ "balance" and not reducing capacity while waiting for a new instance to provision.

Occasionally, we may have reason to terminate a specific EC2 instance in a scale group, and have struggled to have an efficient procedure for doing this. I know that I can terminate the instance directly and it will be replaced, but that reduces the overall capacity of the scalegroup temporarily while waiting for a new instance to provision. In our case this is tens of minutes as we must setup and deploy our software before the ALB can send requests

If we increase the desired_capacity by 1, we can prepare a new instance in advance - but there is no guarantee that it will be created in the same AZ as the instance we wish to terminate. In addition, if I terminate the offending instance, and immediately reduce the desired_capacity will the scalegroup terminate another instance?

So what is the best way to manage this procedure?

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Peter McEvoy
  • 2,816
  • 19
  • 24

2 Answers2

4

You can temporarily suspend and resume specific scaling processes. With this feature you can achieve the desired result in multiple ways, two of which I've described below:

A: Use the Auto Scaling Group's rebalance feature

  1. Increase the Auto Scaling Group's desired instance count by 1 and wait for the new instance to be available
  2. Temporarily suspend the Launch scaling process (this prevents an automatic launch of a new instance during the next step)
  3. Terminate the faulty instance
  4. Decrease the Auto Scaling Group's desired instance count by 1 (the number of desired instances and the actual number of instances should now be in sync again)
  5. Resume the Launch scaling process. If the remaining instances are unbalanced the Auto Scaling Group's AZRebalance process will pick this up and gradually rebalance across the AZs.

B: Explicitly start a new instance in the desired AZ:

  1. Start a separate instance in the desired AZ
  2. Temporarily suspend the Terminate scaling process] (this prevents an automatic termination of the additional instance during the next step)
  3. Attach the instance from (1.) to the Auto Scaling Group
  4. Terminate the original instance (the number of desired instances and the actual number of instances should now be in sync again)
  5. Resume the Terminate scaling process
Dennis Traub
  • 50,557
  • 7
  • 93
  • 108
  • I've updated the answer because there was a confusion of launch and terminate in both variants. – Dennis Traub May 15 '20 at 17:43
  • Optino B sounds like the most workable, simply cos I know exactly where the instance is created. I wonder do you have a reference on "AZRebalance process"? it sounds like something that would migrate my VM from one AZ to another - surely that's not a desirable thing as that type of operation must suspend the VM to move it? – Peter McEvoy May 18 '20 at 09:16
  • When rebalancing, Amazon EC2 Auto Scaling launches new instances before terminating the old ones, so that rebalancing does not compromise the performance or availability of your application. https://docs.aws.amazon.com/autoscaling/ec2/userguide/auto-scaling-benefits.html#AutoScalingBehavior.InstanceUsage – Dennis Traub May 18 '20 at 09:20
  • Just read about `AZRebalance` and its not migration that occurs - it's kill-n-create: "When rebalancing, Amazon EC2 Auto Scaling launches new instances before terminating the old ones, so that rebalancing does not compromise the performance or availability of your application." - unfortunately given the time it takes to deploy our software, we would not want to be exposed while this occurs – Peter McEvoy May 18 '20 at 09:24
  • Option A has less moving parts for you to explicitly manage, with option B, you have complete control over the whole process. So it’s completely up to what you feel better with. – Dennis Traub May 18 '20 at 09:26
  • 1
    Regarding your second comment: Yes. VMs can’t be moved across AZs, they are being recreated. If you have a long startup time, option B can certainly be preferable. – Dennis Traub May 18 '20 at 09:28
4

Auto Scaling provides the ability to:

  • Attach a specific instance to the Auto Scaling group (which was created outside of Auto Scaling)
  • Detach a specific instance from the Auto Scaling group
  • Terminate a specific instance in an Auto Scaling group
  • Temporarily place an instance in an Auto Scaling group into a standby state

When detaching, terminating or placing in standby, the Desired Capacity of the Auto Scaling group can be automatically decremented so no replacement instance is launched, or it can be kept the same so that a replacement instance is launched.

It would generally be a good idea to have Auto Scaling launch any new instances, so that all instances are identical. Thus, if you are concerned about a capacity drop, then you should increment the Desired Capacity to launch a new instance, then terminate the unwanted instance from the Auto Scaling group with a capacity decrease to return the group to the previous Desired Capacity.

You are correct that the instance launched will not be guaranteed to be in the same AZ as the one being removed. Auto Scaling aims to balance AZs. It will launch an instance in an AZ that has the lowest number of instances. Let's say there are two AZs that have an equal number of instances and you wish to remove an instance from AZ A. Incrementing the Desired Capacity might launch an instance in AZ B. Once the unwanted instance has been removed, this would mean that AZ B has two instances more than AZ A. Whether this is a problem depends upon the total number of instances in the Auto Scaling group.

The recommendation to use multiple AZs is to handle situations where an AZ might fail. Such a failure would result in a temporary loss of instances while Auto Scaling launches new instances in the remaining AZs. If such a drop is a concern, it is recommended to run extra instances to handle the temporary capacity drop. Thus, returning to your Question, your Auto Scaling group should have sufficient capacity to handle one instance being removed and replaced. If a temporary drop in capacity is going to impact your system, then it would be a good idea to have extra instances launched, on the assumption that instances can/will fail occasionally. This will also help the rare situation in which an AZ fails, since having extra capacity would mean that the system does not immediately lose 50% of required minimum capacity.

Bottom line: Have sufficient capacity so that temporarily replacing a bad instance should not have a significant impact on the system. The concern about having an unbalanced AZ will be minor (max 2 instances different between AZs) compared to the impact of losing 50% of capacity in an AZ outage if only minimal capacity is being continually deployed.

At the end of the day, it really comes down to cost vs risk. Using more than 2 AZs can reduce the impact of AZ outages.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • I think this is the money quote right here: "Have sufficient capacity so that temporarily replacing a bad instance should not have a significant impact on the system". @dennis-traub answered my question technically, but I think you have answered it philosophically and I will have to go back to my base assumptions... – Peter McEvoy May 18 '20 at 09:29
  • 1
    Yes, I totally agree. And, to be honest, if I had to decide between John’s and my answer, I’d definitely take his over mine because it is not only technically correct, but also conveys the reasoning behind a well-architected application. The cloud provides many new options that on-premises we never had and this allows us to reinvent and rethink how we manage our applications. – Dennis Traub May 18 '20 at 09:36