0

I have a service which is registered with two target groups : alb and wwwalb.

The alb target group is for internal requests, and the wwwalb target group is for external requests.

When I deploy my service, it starts up as it should and starts accepting requests. Looking at the access log, I can see that both the alb and wwwalb probes the service. Since the service runs in 3 zones, I see 3 requests for each zone, 6 in total.

 - - - [19/Jun/2022:20:45:28 +0200] "GET /api/system/status HTTP/1.1" 204 -
 - - - [19/Jun/2022:20:45:28 +0200] "GET /api/system/status HTTP/1.1" 204 -
 - - - [19/Jun/2022:20:45:28 +0200] "GET /api/system/status HTTP/1.1" 204 -
 - - - [19/Jun/2022:20:45:30 +0200] "GET /api/system/status HTTP/1.1" 204 -
 - - - [19/Jun/2022:20:45:30 +0200] "GET /api/system/status HTTP/1.1" 204 -
 - - - [19/Jun/2022:20:45:30 +0200] "GET /api/system/status HTTP/1.1" 204 -

Despite this, the service is eventually taken down because the target groups believes the service is unhealthy. In fact, it never seems to think the service is healthy.

enter image description here

An API call to check on the target group tells me the following :

{
    "TargetHealthDescriptions": [
        {
            "Target": {
                "Id": "10.1.143.94",
                "Port": 8182,
                "AvailabilityZone": "eu-north-1b"
            },
            "HealthCheckPort": "8182",
            "TargetHealth": {
                "State": "unhealthy",
                "Reason": "Target.FailedHealthChecks",
                "Description": "Health checks failed"
            }
        }
    ]
}

I've been looking at target group metrics, load balancer configurations for a while now - but I simply cannot find anything about the setup which could explain this behaviour. The health check settings seems fine to me as well :

enter image description here

I just recently added the wwwalb, so I'm thinking that somehow having this service in two target groups causes this. Then again, having a service in two target groups is supported and explained by AWS.

Is there a way to get more details from AWS about what's really causing this issue? Any way of looking into why AWS believes the service is failing?

sbrattla
  • 1,578
  • 4
  • 28
  • 52
  • Do you have any security groups associated with your ALB, and/or your target instances, which might be blocking egress (from the ALB) and/or ingress (to your instances) on that health check port? – Castaglia Jun 19 '22 at 22:17
  • I'm not entirely sure how the unhealthy threshold count works when an application is just deployed, but it seems that the unhealthy threshold count of 2 triggered the load balancer to take the application down. The application takes more than a minute to start, and during startup the application will be unresponsive, and will not return a "whitelisted" response code. – sbrattla Jun 20 '22 at 05:05

1 Answers1

0

I typically set my unhealthy threshold to something higher than my healthy threshold. Like 2 successful calls with a 10 second interval is healthy, 6 unsuccessful calls with a 10 second interval is unhealthy.

That said, it shouldn't matter and your settings should work. When a target is registering, there is an "initial" state that occurs. During that time, AWS is trying to validate healthchecks and should only switch to a healthy state if the healthchecks are successful.

It can take a few minutes for the registration process to complete and health checks to start.

Are you sure your application isn't replying successfully and then failing for a long enough time that it goes unhealthy again? Or is it really that it takes too long to start up and never gets out of the "initial" state?

James
  • 1
  • 1
  • the application health check endpoint is a static endpoint which returns a 204 OK. No moving parts involved. But the application takes about 70 seconds to start up, and it seems that AWS failed the application during that time. I have increased the thresholds, and it's running fine now... – sbrattla Jun 21 '22 at 12:11