How do I work out why an ECS health-check is failing?

Question

Outline:

I have a very simple ECS container which listens on port 5000 and writes out HelloWorld, plus the hostname of the instance it is running on. I want to deploy many of these containers using ECS and load balance them just to really learn more about how this works. And it is working to a certain extent but my health check is failing (time out) which is causing the containers tasks to be bounced up and down.

Current configuration:

1 VPC ( 10.0.0.0/19 )
1 Internet gateway
3 private subnets, one for each AZ in eu-west-1 (10.0.0.0/24, 10.0.1.0/24, 10.0.2.0/24)
3 public subnets, one for each AZ in eu-west-1 (10.0.10.0/24, 10.0.11.0/24, 10.0.12.0/24)
3 NAT instances, one in each of the public subnets, routing 0.0.0.0/0 to the Internet gateway and each assigned an Elastic IP
3 ECS instances, again one in each private subnet with a route to the NAT instance in the corresponding public subnet in the same AZ as the ECS instance
1 ALB load balancer (Internet facing) which is registered with my 3 public subnets
1 Target group (with no instances registered as per ECS documentation) but a health check set up on the 'traffic' port at /health
1 Service bringing up 3 tasks spread across AZs and using dynamic ports (which are then mapped to 5000 in the docker container)

Routing

Each private subnet has a rule to 10.0.0.0/19, and a default route for 0.0.0.0/0 to the NAT instance in public subnet in the same AZ as it.

Each public subnet has the same 10.0.0.0/19 route and a default route for 0.0.0.0/0 to the internet gateway.

Security groups

My instances are in a group that allows egress to anywhere and ingress on ports 32768 - 65535 from the security group the ALB is in.

The ALB is in a security group that allows ingress on port 80 only but egress to the security group my ECS instances are in on any port/protocol

What happens

When I bring all this up, it actually works - I can take the public dns record of the ALB and refresh and I see responses coming back to me from my container app telling me the hostname. This is exactly what I want to achieve however, it fails the health check and the container is drained, and replaced - with another one that fails the health check. This continues in a cycle, I have never seen a single successful health check.

What I've tried

Tweaked the health check intervals to make ECS require about 5 minutes of solid failed health-checks before killing the task. I thought this would eliminate it being a bit sensitive when the task starts up? This still goes on to trigger the tear-down, despite me being able to view the application running in my browser throughout.
Confirmed the /health url end point in a number of ways. I can retrieve it publicly via the ALB (as well as view the main app root url at '/') and curl tells me has a proper 200 OK response (which the health check is set to look for by default). I have ssh'ed into my ECS instances and performed a curl --head {url} on '/' and '/health' and both give a 200 OK response. I've even spun up another instance in the public subnet, granted it the same access as the ALB security group to my instances and been able to curl the health check from there.

Summary

I can view my application correctly load-balanced across AZs and private subnets on both its main url '/' and its health check url '/health' through the load balancer, from the ECS instance itself, and by using the instances private IP and port from another machine within the public subnet the ALB is in. The ECS service just cannot see this health check once without timing out. What on earth could I be missing??

I had a similar problem and solved it by running it locally and using the docker inspect command to get health info. I used a base image that had no curl installed.. — Lodewijck, Jul 23 '23 at 12:18

Neil Trodden · Answer 1 · 2017-03-15T23:01:38.437

For any that follow, I managed to break the app in my container accidentally and it was throwing a 500 error. Crucially though, the health check started reporting this 500 error -> therefore it was NOT a network timeout. Which means that when the health-check contacts the end point in my app, it was not handling the response properly and this appears to be a problem related to Nancy (the api framework I was using) and Go which sometimes reports "Client.Timeout exceeded while awaiting headers" and I am sure ECS is interpreting this as a network time-out. I'm going to tcpdump the network traffic and see what the health-check is sending and Nancy is responding and compare that to a container that works. Perhaps there is a Nancy fix or maybe ECS needs to not be so fussy.

edit:

By simply updating all the nuget packages that my Nancy app was using to the latest available and suddenly everything started working!

Glad you solved it. Was going to suggest that you check the app logs to confirm whether it was receiving the healthchecks. One tangential question as I stumbled across this while getting NAT working in an ALB setup: why are you using NAT instances instead of NAT gateways? — ajl, Apr 19 '19 at 14:28

score 0 · Answer 2 · answered Mar 14 '17 at 04:58

0

More questions than answers. but maybe they will take you in the right direction.

You say that you can access the container app via the ALB, but then the node fails the heath check. The ALB should not be allowing connection to the node until it's health check succeeds. So if you are connecting to the node via the ALB, then the ALB must have tested and decided it was healthy. Is it a different health check that is killing the node ?

Have you check CloudTrail to see if it has any clues about what is triggering the tear-down ? Is the tear down being triggered by the ALB or the auto scaling group? Might it be that auto scaling group has the wrong scale-in criteria ?

Good luck

answered Mar 14 '17 at 04:58

Polymath

125
7

Thank you for your input! I will check cloudtail and see if there is a record of the health-check. As for more questions than answers, I'm just trying to work out 'where next' to troubleshoot this so thank-you! – Neil Trodden Mar 14 '17 at 10:01
Unfortunately, the CloudTrail simply repeats the time-out message. I've even been able to curl the healthcheck from another instance in the same private subnet and it's a valid 200 OK response. So I can connect from the public web, my public subnet, another instance on my private subnet and the instance itself. No idea :-| – Neil Trodden Mar 14 '17 at 22:28