AWS Auto Scaling Group does not detect instance is unhealthy from ELB

Question

I’m trying to get an AWS Auto Scaling Group to replace ‘unhealthy’ instances, but I can’t get it to work.

From the console, I’ve created a Launch Configuration and, from there, an Auto Scaling Group with an Application Load Balancer. I've kept all settings regarding the target group and listeners the same as the default settings. I’ve selected ‘ELB’ as an additional health check type for the Auto Scaling Group. I’ve consciously misconfigured the Launch Configuration to result in ‘broken’ instances -- there is no web server to listen to the port configured in the listener.

The Auto Scaling Group seems to be configured correctly and is definitely aware of the load balancer. However, it thinks the instance it has spun up is healthy.

// output of aws autoscaling describe-auto-scaling-groups:

{
    "AutoScalingGroups": [
        {
            "AutoScalingGroupName": "MyAutoScalingGroup",
            "AutoScalingGroupARN": "arn:aws:autoscaling:eu-west-1:<accountId>:autoScalingGroup:3edc728f-0831-46b9-bbcc-16691adc8f44:autoScalingGroupName/MyAutoScalingGroup",
            "LaunchConfigurationName": "MyLaunchConfiguration",
            "MinSize": 1,
            "MaxSize": 3,
            "DesiredCapacity": 1,
            "DefaultCooldown": 300,
            "AvailabilityZones": [
                "eu-west-1b",
                "eu-west-1c",
                "eu-west-1a"
            ],
            "LoadBalancerNames": [],
            "TargetGroupARNs": [
                "arn:aws:elasticloadbalancing:eu-west-1:<accountId>:targetgroup/MyAutoScalingGroup-1/1e36c863abaeb6ff"
            ],
            "HealthCheckType": "ELB",
            "HealthCheckGracePeriod": 300,
            "Instances": [
                {
                    "InstanceId": "i-0b589d33100e4e515",
                    // ...
                    "LifecycleState": "InService",
                    "HealthStatus": "Healthy",
                    // ...
                }
            ],
            // ...
        }
    ]
}

The load balancer, however, is very much aware that the instance is unhealthy:

// output of aws elbv2 describe-target-health:

{
    "TargetHealthDescriptions": [
        {
            "Target": {
                "Id": "i-0b589d33100e4e515",
                "Port": 80
            },
            "HealthCheckPort": "80",
            "TargetHealth": {
                "State": "unhealthy",
                "Reason": "Target.Timeout",
                "Description": "Request timed out"
            }
        }
    ]
}

Did I just misunderstand the documentation? If not, what else is needed to be done to get the Auto Scaling Group to understand that this instance is not healthy and refresh it?

To be clear, when instances are marked unhealthy manually (i.e. using aws autoscaling set-instance-health), they are refreshed as is expected.

You are waiting 5 min when the instance in the `ELB` becomes unhealthy right? — Riz, Feb 24 '22 at 16:23
@Riz Yes, even after waiting for many hours, the situation is unchanged. — ErikHeemskerk, Feb 25 '22 at 08:41
@ ErikHeemskerk, Can you check `Advanced configurations` -> `Termination policies` and `Suspended processes` and also `Instance scale-in protection`? — Riz, Feb 25 '22 at 09:28
@Riz Instances are not protected from scale-in. Termination policies is set to ‘Default’, and Suspended processes is empty. — ErikHeemskerk, Feb 27 '22 at 07:21
This makes me think there is a misconfiguration in `Auto Scaling groups`. Can you confirm you have the correct `Target group` in `Auto Scaling groups` -> `Details`(tab)->`Load balancing`? — Riz, Feb 27 '22 at 23:29
@Riz Yep, that’s linking to the correct load balancer target group. — ErikHeemskerk, Feb 28 '22 at 08:57
Does this happen with only this one ALB and ASG? What if you create everything again from scratch? — Marcin, Mar 14 '22 at 03:28
@Marcin I've destroyed and recreated everything from scratch many times over; that does not matter. — ErikHeemskerk, Mar 14 '22 at 15:30

Register Sole · Accepted Answer · 2022-03-14T12:13:03.257

Explanation

If you have consciously misconfigured the instance from the start and the ELB Health Check has never passed, then the Auto Scaling Group does not acknowledge yet that your ELB/Target Group is up and running. See this page of the documentation.

After at least one registered instance passes the health checks, it enters the InService state.

And

If no registered instances pass the health checks (for example, due to a misconfigured health check), ... Amazon EC2 Auto Scaling doesn't terminate and replace the instances.

I configured from scratch and arrived at the same behavior as what you described. To verify that this is indeed the root cause, check the Target Group status in the ASG. It is probably in Added state instead of InService.

[cloudshell-user@ip-10-0-xx-xx ~]$ aws autoscaling describe-load-balancer-target-groups --auto-scaling-group-name test-asg
{
    "LoadBalancerTargetGroups": [
        {
            "LoadBalancerTargetGroupARN": "arn:aws:elasticloadbalancing:us-east-1:xxx:targetgroup/asg-test-1/abc",
            "State": "Added"
        }

Resolution

To achieve the desired behavior, what I did was

Run a simple web service on port 80. Ensure Security Group is open for the ELB to talk to EC2.
Wait until the ELB status is healthy. Ensure server is returning 200. You may need to create an empty index.html just to pass the health check.
Wait until the target group status has become InService in the ASG.

For example, for Step 3:

[cloudshell-user@ip-10-0-xx-xx ~]$ aws autoscaling describe-load-balancer-target-groups --auto-scaling-group-name test-asg
{
    "LoadBalancerTargetGroups": [
        {
            "LoadBalancerTargetGroupARN": "arn:aws:elasticloadbalancing:us-east-1:xxx:targetgroup/test-asg-1-alb/abcdef",
            "State": "InService"
        }
    ]
}

Now that it is in service, turn off the web server and wait. Check often, though, as once ASG detects it is unhealthy it will terminate.

[cloudshell-user@ip-10-0-xx-xx ~]$ aws autoscaling describe-auto-scaling-groups
{
    "AutoScalingGroups": [
        {
            "AutoScalingGroupName": "test-asg",
            "AutoScalingGroupARN": "arn:aws:autoscaling:us-east-1:xxx:autoScalingGroup:abc-def-ghi:autoScalingGroupName/test-asg",
            ...
            "LoadBalancerNames": [],
            "TargetGroupARNs": [
                "arn:aws:elasticloadbalancing:us-east-1:xxx:targetgroup/test-asg-1-alb/abc"
            ],
            "HealthCheckType": "ELB",
            "HealthCheckGracePeriod": 300,
            "Instances": [
                {
                    "InstanceId": "i-04bed6ef3b2000326",
                    "InstanceType": "t2.micro",
                    "AvailabilityZone": "us-east-1b",
                    "LifecycleState": "Terminating",
                    "HealthStatus": "Unhealthy",
                    "LaunchTemplate": {
                        "LaunchTemplateId": "lt-0452c90319362cbc5",
                        "LaunchTemplateName": "test-template",
                        "Version": "1"
                    },
             ...
        },
    ...
    ]
}

This seems to be in the right direction. My aim is to get the ASG to automatically terminate and restart instances that did not launch correctly. What you (and the docs) seem to be suggesting is that if the ASG is set to a desired size of 2 instances, and _one_ of those instances doesn’t launch correctly, the ASG _will_ terminate it. How could I test this hypothesis? — ErikHeemskerk, Mar 14 '22 at 15:38
@ErikHeemskerk In the answer I was testing with one instance, but your statement is also correct. You can set desired size of 2 instances. Initially, both will fail the health check because there are no web server running. Then, in one of the instance, run a web server so that it passes the health check (step 1 and 2 in the answer). Then you can use the cmd in step 3 to monitor until its status goes to `InService`. It should then terminate the other instance. — Register Sole, Mar 15 '22 at 01:17

AWS Auto Scaling Group does not detect instance is unhealthy from ELB

1 Answers1

Explanation

Resolution