3

I have an ECS service that serves an SSH process. I am deploying updates to this service through CodeDeploy. I noticed that this service is much slower to deploy than other services with identical images deployed at the same time using CodePipeline. The difference with this service is that it's behind an NLB (the others are no LB or behind an ALB).

The service is set to 1 container, deploying 200%/100% so the services brings up 1 new container, ensure's it's healthy, then removes the old one. What I see happen is:

  1. New Container started in Initial state
  2. 3+ minutes later, New Container becomes Healthy. Old Container enters Draining
  3. 2+ minutes later, Old Container finishes Draining and stops

Deploying thus takes 5-7 minutes, mostly waiting for health checks or draining. However, I'm pretty sure SSH starts up very quickly, and I have the following settings on the target group which should make things relatively quick:

  • TCP health check on the correct port
  • Healthy/Unhealthy threshold: 2
  • Interval: 10s
  • Deregistation Delay: 10s
  • ECS Docker stop custom timeout: 65s

So the minimum time from SSH being up to the old container being terminated would be:

  • 2*10=20s for TCP health check to turn to Healthy
  • 10s for the deregistration delay before Docker stop
  • 65s for the Docker stop timeout

This is 115 seconds, which is a lot less the observed 5-7 minutes. Other services take 1-3 minutes and the LB/Target Group timings are not nearly as aggressive there.

Any ideas why my service behind an NLB seems slow to cycle through these lifecycle transitions?

Eric M. Johnson
  • 6,657
  • 2
  • 13
  • 23

1 Answers1

5

You are not doing anything wrong here; this simply appears to be a (current) limitation of this product.

I recently noticed similar delays in registration/availability time with ECS services behind an NLB and decided to explore. I created a simple Javascript TCP echo server and set it up as an ECS service behind an NLB (ECS service count of 1). Like you, I used a TCP healthcheck with a healthy/unhealthy threshold of 2 and interval/deregistration delay of 10 seconds.

After the initial deploy was successful and the service reachable via the NLB, I wanted to see how long it would take for service to be restored in the event of a complete failure of the underlying instance. To simulate, I killed the service via the ECS console. After several iterations of this test, I consistently observed a timeline similar to the following (times are in seconds):

0s:   killed service
5s:   ECS reports old service draining
      Target Group shows service draining
      ECS reports new service instance is started
15s:  ECS reports new task is registered
      Target Group shows new instance with status of 'initial'
135s: TCP healthcheck traffic from the load balancer starts arriving 
      for the service (as measured by tcpdump on the EC2 host running 
      the container)
225s: Target Group finally marks the service as 'healthy'
      ECS reports service has reached a steady state

I performed the same tests with a simple express app behind an ALB, and the gap between ECS starting the service and the ALB reporting it healthy was 10-15 seconds. The best result we achieved testing the NLB was 3.5 minutes from service stop to full availability.

I shared these findings with AWS via support case, asking specifically for clarification on why there was a consistent 120 second gap before the NLB started healthchecking the service and why we consistently saw 90-120 seconds between the beginning of healthchecks and service availability. They confirmed that this behavior is known but did not offer a time for resolution or a strategy to decrease latency in service availability.

Unfortunately, this will not do much to help resolve your issue, but at least you can know that you're not doing anything wrong.

bjcube
  • 136
  • 3
  • Thanks for verifying I'm not missing anything. It's disappointing that it takes this long, but at least it's confirmed. – Eric M. Johnson Feb 20 '19 at 16:24
  • I'm seeing the same issue, with replacement tasks sitting idle for 120 seconds in target group state "initial" before the first health check comes in. Apart from this SO question, there seems to be zero information about this online. Have you @bjcube or eric-m-johnson heard any updates about it from AWS, or found any way to alleviate it? – svenx Sep 06 '19 at 11:25
  • 1
    AWS continues to provide no ETA on a solution, nor have they offered recommendations to work around the issue. I have found no solutions here, aside from minimizing use of NLBs. – bjcube Sep 09 '19 at 13:06