AWS ALB Healthchecks Against ECS Services Periodically Failure For No Discernable Reason

Question

We completely host a constellation of services in AWS (no external dependencies as far as these services go). We periodically receive healthcheck failures (502 as public services try to contact the internal service ALBs), as frequent as every hour or two. The services experience no disruption whatsoever.

I've tried all manner of healthcheck settings (long and short durations, high and low counts [until considered successful or failed]). When I've looked at the HTTP log in the past, I believe that there were no records whatsoever for the failed requests; I'd just assumed that the service was downed before the request finished and one could be written. We have regular activity, but it wouldn't be considered high-volume. Congestion is not a factor (which might disrupt normal requests, but isn't per the above).

We have more than one load-balanced instance per service.

This is a long-running issue and I've periodically searched and tried whatever reasonable approaches have been suggested, but I've had no luck learning anything further.

The platform is largely uWSGI (Python) behind Nginx.

How could I further debug this?

score 1 · Answer 1 · answered Jul 20 '23 at 02:03

This is a super PITA to troubleshoot with what you can get from the console. If you really and truly want to dig into this, pony up whatever they want these days for developer support. It used to be $20/month ... but you will get access to premium support folks. They're smart and paid to dig into this sort of thing! You mentioned that you have tried all manner of settings... the service quotas in here are a mess of stuff! You're likely just tripping a simple service quota/limit that's placed on all cloud service accounts from the get go? If you pony up for developer support, even just for a month (or as needed) you can get pretty darn quick response from their premium support folks who can put in quota increases. There are some limitations still, but those application service limits are probably what you're running into. Start here: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-limits.html

If you can adjust your solution to fit within the confines of the default limits, that's probably best, overall. If you do get a quota increase from the support folk, please document it?! Someone doing this later might not be you and they will get tripped up on this too. Do them a solid?

I hope this helps? Good luck!

None of those limits can explain the heartbeat aperiodically failing even while the system is still operational in general. Unfortunately, the developer plan is $29 *OR* 3% of the month's bill, whichever is bigger, and that cost significantly exceeds the value of the fix where the problem doesn't actually disrupt operations. — Dustin Oprea, Jul 22 '23 at 04:26
Have you tried using the AWS CLI to extract further information at all? If you throw the verbose switches it can expose more information that could prove useful? I get the billshock thing. It's how they getcha! in the end!!! — t3ln3t, Aug 04 '23 at 01:48
Use the AWS CLI to look at what? Verbose switches to what? This is an ECS deployment. We tell it to use a new task-definition, it starts a number of new instances, and the TG reports when they go healthy. There's nothing to request verbosity from (aside from configuring access-logs in the ALB, but this isn't a "command" thing). This is a network-level phenomenon. Its lower-level details are somewhat obscured from our view. — Dustin Oprea, Aug 14 '23 at 08:42

AWS ALB Healthchecks Against ECS Services Periodically Failure For No Discernable Reason

1 Answers1