I have an ECS service that serves an SSH process. I am deploying updates to this service through CodeDeploy. I noticed that this service is much slower to deploy than other services with identical images deployed at the same time using CodePipeline. The difference with this service is that it's behind an NLB (the others are no LB or behind an ALB).
The service is set to 1 container, deploying 200%/100% so the services brings up 1 new container, ensure's it's healthy, then removes the old one. What I see happen is:
- New Container started in
Initial
state - 3+ minutes later, New Container becomes
Healthy
. Old Container entersDraining
- 2+ minutes later, Old Container finishes
Draining
and stops
Deploying thus takes 5-7 minutes, mostly waiting for health checks or draining. However, I'm pretty sure SSH starts up very quickly, and I have the following settings on the target group which should make things relatively quick:
- TCP health check on the correct port
- Healthy/Unhealthy threshold: 2
- Interval: 10s
- Deregistation Delay: 10s
- ECS Docker stop custom timeout: 65s
So the minimum time from SSH being up to the old container being terminated would be:
- 2*10=20s for TCP health check to turn to Healthy
- 10s for the deregistration delay before Docker stop
- 65s for the Docker stop timeout
This is 115 seconds, which is a lot less the observed 5-7 minutes. Other services take 1-3 minutes and the LB/Target Group timings are not nearly as aggressive there.
Any ideas why my service behind an NLB seems slow to cycle through these lifecycle transitions?