spring-cloud-netflix zero downtime deployments on AWS ECS

Question

We're running spring-cloud microservices using eureka on AWS ECS. We're also doing continuous deployment, and we've run into an issue where rolling production deployments cause a short window of service unavailability. I'm focusing here on @LoadBalanced RestTemplate clients using ribbon. I think I've gotten retry working adequately in my local testing environment, but I'm concerned about new service instance eureka registration lag time and the way ECS rolling deployments work.

When we merge a new commit to master, if the build passes (compiles and tests pass) our jenkins pipeline builds and pushes a new docker image to ECR, then creates a new ECS task definition revision pointing to the updated docker image, and updates the ECS service. As an example, we have an ECS service definition with desired task count set to 2, minimum percent available set to 100%, and maximum percent available set to 200%. The ECS service scheduler starts 2 new docker containers using the new image, leaving the existing 2 docker container running on the old image. We use container health checks that pass once the actuator health endpoint returns 200, and as soon as that happens, the ECS service scheduler stops the 2 old containers running on the old docker image.

My understanding here could be incorrect, so please correct me if I'm wrong about any of this. Eureka clients fetch the registry every 30 seconds, so there's up to 30 seconds where all the client has in the server list is the old service instances, so retry won't help there.

I asked AWS support about how to delay ECS task termination during rolling deploys. When ECS services are associated with an ALB target group, there's a deregistration delay setting that ECS respects, but no such option exists when a load balancer is not involved. The AWS response was to run the java application via an entrypoint bash script like this:

#!/bin/bash

cleanup() {
    date
    echo "Received SIGINT, sleeping for 45 seconds"
    sleep 45
    date
    echo "Killing child process"
    kill -- -$$
}

trap 'cleanup' SIGTERM

"${@}" &

wait $!

When ECS terminates the old instances, it send SIGTERM to the docker container, this script traps it, sleeps for 45 seconds, then continues with the shutdown. I'll also have to change an ecs config parameter in /etc/ecs that controls the grace period before ECS sends a SIGKILL after the SIGTERM, which defaults to 30 seconds, which is not quite long enough.

This feels dirty to me. I'm not sure that script isn't going to cause some other unforeseen issue; does it forward all signals appropriately? It feels like an unwanted complication.

Am I missing something? Can anyone spot anything wrong with AWS support's suggested entrypoint script approach? Is there a better way to handle this and achieve the desired result, which is zero downtime rolling deployments on services registered in eureka on ECS?

spring-cloud-netflix zero downtime deployments on AWS ECS

0 Answers0