3

For my high-traffic containerized app running in ECS Fargate, slow ramp-up is required for new containers, to avoid out of memory situation immediately after startup. This is especially important during the update service operation when all the containers are replaced at the same time.

How can I get this to work with ECS Fargate and ALB, making sure the old containers stay around until the slow_start period for the new containers is over?

This is my current terraform setup. I enabled slow_start, but during update service the old containers are stopped too early, so that the new containers get full traffic instantly.

resource "aws_alb_target_group" "my_target_group" {
  name        = "my_service"
  port        = 8080
  protocol    = "HTTP"
  vpc_id      = data.aws_vpc.active.id
  target_type = "ip"
  slow_start  = 120

  health_check {
    enabled             = true
    port                = 8080
    path                = "/healthCheck"
    unhealthy_threshold = 2
    healthy_threshold   = 2
  }
}

resource "aws_ecs_service" "my_service" {
  name                               = "my_service"
  cluster                            = aws_ecs_cluster.my_services.id
  task_definition                    = aws_ecs_task_definition.my_services.arn
  launch_type                        = "FARGATE"
  desired_count                      = var.desired_count
  deployment_maximum_percent         = 400
  deployment_minimum_healthy_percent = 100
  enable_execute_command             = true

  wait_for_steady_state = true

  network_configuration {
    subnets         = data.aws_subnets.private.ids
    security_groups = [aws_security_group.my_service_container.id]
  }

  load_balancer {
    container_name   = "my-service"
    container_port   = 8080
    target_group_arn = aws_alb_target_group.my_target_group.arn
  }

  lifecycle {
    create_before_destroy = true
    ignore_changes        = [desired_count]
  }
}
Bastian Voigt
  • 5,311
  • 6
  • 47
  • 65
  • You can try with [stopTimeout](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html#ContainerDefinition-stopTimeout). – Marko E Jan 23 '23 at 12:25
  • Hmm, I think the stopTimeout is used only for containers that refuse to shutdown and need to be forcefully killed. This is not the case here, my application shuts down cleanly. – Bastian Voigt Jan 23 '23 at 12:36
  • Have you tried setting a larger value in the `deregistration_delay` option? – javierlga Jan 23 '23 at 21:05
  • The documentation says that the default deregistration_delay is 300 seconds, however my containers are stopped after ~40 seconds already, as soon as the new ones are up and running. Also my requests have a very low response time around 10-30ms, so I think that deregistration is not the main concern here. My feeling is that the ECS deployment does not know about the ALB slow_start feature, so it terminates the containers before the ramp up is finished. – Bastian Voigt Jan 24 '23 at 08:19
  • What is your desired_count? I wonder if the issue is the min% is too low, resulting in ECS deployment terminating 'old' tasks too early - this leaves only 'new' tasks so they are immediately dropped out of slow-start mode? – Fermin Jan 26 '23 at 12:06
  • Desired count is normally 4, but I always need to update it manually to 8 before deploying new releases, because otherwise I run into out of memory issues. A few minutes after deployment is finished I scale it down to 4 again. Min% is set to 100 – Bastian Voigt Jan 26 '23 at 12:29

1 Answers1

0

aws ecs usually sends sigterm to gracefully shutdown and sends sigkill if 30 second passed.

So, you can handle this sigterm signal (example to catch this signal in python) and add delay in your code. After that, you need to adjust sigkill 30 second wait with stopTimeout in ContainerDefinition to stop aws shutting down the ecs.

harnamc
  • 541
  • 6
  • 20