0

I'm new to ECS Fargate and Terraform; I've based most of the config below on tutorials/blogs.

What I'm seeing:

  • My app doesn't start because it can't connect to RDS (cloudwatch logs). This is OK since I've not yet configured RDS.
  • ECS / Fargate drains the task that failed and creates a new ones.

This behaviour is expected.

But; I expect the deployment to fail because it simply won't boot any of the ECS container successfully (the ALB health check never returns true).

The config I've setup is designed to fail for the following reasons:

  • The ALB health_check is configured to match 499 reponse status (which doesn't exist in my app, in fact my app doesn't even have a /health checkpoint!)
  • The app doesn't start at all and quits within 10 seconds after booting, not even starting any HTTP service listener

But; the deployment always succeeds despite no container every being alive :-(

What I'm seeing is (assuming the desired app count is 3):

  • After deployment the ECS task gets "3 Pending Tasks"
  • It will start with "1 Running Task" and "2 Pending Tasks", which fails and goes back to "3 Pending Tasks"
  • Frequently it shows "2 Running Tasks", but they will fail and go back to "Pending tasks"
  • After a while it will briefly list "3 Running Tasks"
  • The moment it shows "3 Running Tasks" the deployment succeeds.

When the ECS lists "3 Running Tasks" none of the ALB health checks ever succeeded; running means it starts the container but it doesn't mean the health check succeeded.

It seems ECS only considers the "Running" state for success and never the ALB health check; which goes counter to what I've been reading how this is supposed to work.

On top of that, it starts new tasks even before the one started previously is completely healthy (here too ignoring the ALB health check). I was expecting it to start 1 container at a time (based on the ALB health check).

There are loads of topics about failing ECS deployments due to failed ELB health checks; but I'm encountering the exact opposite and struggling to find an explanation.

Given I'm new to all this I'm assuming I've made a misconfiguration or have some misunderstanding of how it is supposed to work.

But after more than 12 hours I'm not seeing it...

Hope someone can help!

I've configured the following terraform:

locals {
  name         = "${lower(var.project)}-${var.env}"
  service_name = "${local.name}-api"

  port = 3000
}

resource "aws_lb" "api" {
  name               = "${local.service_name}-lb"
  internal           = false
  load_balancer_type = "application"
  tags               = var.tags

  subnets = var.public_subnets

  security_groups = [
    aws_security_group.http.id,
    aws_security_group.https.id,
    aws_security_group.egress-all.id,
  ]
}

resource "aws_lb_target_group" "api" {
  name        = local.service_name
  port        = 3000
  protocol    = "HTTP"
  target_type = "ip"
  vpc_id      = var.vpc_id
  tags        = var.tags

  health_check {
    enabled             = true
    healthy_threshold   = 3
    interval            = 30
    path                = "/"
    port                = "traffic-port"
    protocol            = "HTTP"
    matcher             = "499" # This is a silly reponse code, it never succeeds
    unhealthy_threshold = 3
  }

  # NOTE: TF is unable to destroy a target group while a listener is attached,
  # therefore create a new one before destroying the old. This also means
  # we have to let it have a random name, and then tag it with the desired name.
  lifecycle {
    create_before_destroy = true
  }

  depends_on = [aws_lb.api]
}

resource "aws_lb_listener" "api-http" {
  load_balancer_arn = aws_lb.api.arn
  port              = "80"
  protocol          = "HTTP"

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.api.arn
  }
}

# This is the role under which ECS will execute our task. This role becomes more important
# as we add integrations with other AWS services later on.
#
# The assume_role_policy field works with the following aws_iam_policy_document to allow
# ECS tasks to assume this role we're creating.
resource "aws_iam_role" "ecs-alb-role" {
  name               = "${local.name}-api-alb-role"
  assume_role_policy = data.aws_iam_policy_document.ecs-task-assume-role.json
  tags               = var.tags
}

data "aws_iam_policy_document" "ecs-task-assume-role" {
  statement {
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["ecs-tasks.amazonaws.com"]
    }
  }
}

data "aws_iam_policy" "ecs-alb-role" {
  arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

# Attach the above policy to the execution role.
resource "aws_iam_role_policy_attachment" "ecs-alb-role" {
  role       = aws_iam_role.ecs-alb-role.name
  policy_arn = data.aws_iam_policy.ecs-alb-role.arn
}

# Based on:
# https://section411.com/2019/07/hello-world/

resource "aws_ecs_cluster" "cluster" {
  name = "${local.name}-cluster"
  tags = var.tags
}

resource "aws_ecs_service" "ecs-api" {
  name            = local.service_name
  task_definition = aws_ecs_task_definition.ecs-api.arn
  cluster         = aws_ecs_cluster.cluster.id
  launch_type     = "FARGATE"
  desired_count   = var.desired_count
  tags            = var.tags

  network_configuration {
    assign_public_ip = false
    security_groups = [
      aws_security_group.api-ingress.id,
      aws_security_group.egress-all.id
    ]
    subnets = var.private_subnets
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.api.arn
    container_name   = var.container_name
    container_port   = local.port
  }

  # not sure what this does, it doesn't fix the problem though regardless of true/false
  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }
}

resource "aws_cloudwatch_log_group" "ecs-api" {
  name = "/ecs/${local.service_name}"
  tags = var.tags
}

resource "aws_ecs_task_definition" "ecs-api" {
  family             = local.service_name
  execution_role_arn = aws_iam_role.ecs-alb-role.arn
  tags               = var.tags

  # These are the minimum values for Fargate containers.
  cpu                      = 256
  memory                   = 512
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"

  container_definitions = <<EOF
  [
    {
      "name": "${var.container_name}",
      "image": "${var.ecr_url}/${var.container_name}:latest",
      "portMappings": [
        {
          "containerPort": ${local.port}
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-region": "${var.aws_region}",
          "awslogs-group": "/ecs/${local.service_name}",
          "awslogs-stream-prefix": "ecs"
        }
      }
    }
  ]
  EOF
}


resource "aws_security_group" "http" {
  name        = "http"
  description = "HTTP traffic"
  vpc_id      = var.vpc_id
  tags        = var.tags

  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "TCP"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_security_group" "https" {
  name        = "https"
  description = "HTTPS traffic"
  vpc_id      = var.vpc_id
  tags        = var.tags

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "TCP"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_security_group" "egress-all" {
  name        = "egress_all"
  description = "Allow all outbound traffic"
  vpc_id      = var.vpc_id
  tags        = var.tags

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_security_group" "api-ingress" {
  name        = "api_ingress"
  description = "Allow ingress to API"
  vpc_id      = var.vpc_id
  tags        = var.tags

  ingress {
    from_port   = 3000
    to_port     = 3000
    protocol    = "TCP"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

My github action deploy config:

# This is based on:
# - https://docs.github.com/en/actions/guides/deploying-to-amazon-elastic-container-service
# - https://particule.io/en/blog/cicd-ecr-ecs/

env:
  AWS_REGION: eu-west-1
  ECR_REPOSITORY: my-service-api
  ECS_SERVICE: my-service-dev-api
  ECS_CLUSTER: my-service-dev-cluster
  TASK_DEFINITION: arn:aws:ecs:eu-west-1:123456789:task-definition/my-service-dev-api

name: Deploy
on:
  push:
    branches:
      - main
jobs:
  build:
    name: Deploy
    runs-on: ubuntu-latest
    timeout-minutes: 10
    permissions:
      packages: write
      contents: read
    steps:
      - name: Checkout
        uses: actions/checkout@v2

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@13d241b293754004c80624b5567555c4a39ffbe3
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ env.AWS_REGION }}

      - name: Login to Amazon ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@aaf69d68aa3fb14c1d5a6be9ac61fe15b48453a2

      - name: Build, tag, and push image to Amazon ECR
        id: build-image
        env:
          ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
          IMAGE_TAG: ${{ github.sha }}
        run: |
          # Build a docker container and push it to ECR so that it can be deployed to ECS.
          docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$GITHUB_RUN_NUMBER .
          docker push $ECR_REGISTRY/$ECR_REPOSITORY:$GITHUB_RUN_NUMBER

          # Tag docker container with git tag for debugging purposes
          docker tag $ECR_REGISTRY/$ECR_REPOSITORY:$GITHUB_RUN_NUMBER $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
          docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG

          # We tag with ":latest" for debugging purposes, but don't use it for deployment
          docker tag $ECR_REGISTRY/$ECR_REPOSITORY:$GITHUB_RUN_NUMBER $ECR_REGISTRY/$ECR_REPOSITORY:latest
          docker push $ECR_REGISTRY/$ECR_REPOSITORY:latest

          echo "::set-output name=image::$ECR_REGISTRY/$ECR_REPOSITORY:$GITHUB_RUN_NUMBER"

      - name: Download task definition
        id: download-task
        run: |
          aws ecs describe-task-definition \
            --task-definition ${{ env.TASK_DEFINITION }} \
            --query taskDefinition > task-definition.json
          echo ${{ env.TASK_DEFINITION }}
          echo "::set-output name=revision::$(cat task-definition.json | jq .revision)"

      - name: Fill in the new image ID in the Amazon ECS task definition
        id: task-def
        uses: aws-actions/amazon-ecs-render-task-definition@v1
        with:
          task-definition: task-definition.json
          container-name: ${{ env.ECR_REPOSITORY }}
          image: ${{ steps.build-image.outputs.image }}

      - name: Deploy Amazon ECS task definition
        uses: aws-actions/amazon-ecs-deploy-task-definition@v1
        with:
          task-definition: ${{ steps.task-def.outputs.task-definition }}
          service: ${{ env.ECS_SERVICE }}
          cluster: ${{ env.ECS_CLUSTER }}
          wait-for-service-stability: true
          wait-for-minutes: 5

      - name: De-register previous revision
        run: |
          aws ecs deregister-task-definition \
            --task-definition ${{ env.TASK_DEFINITION }}:${{ steps.download-task.outputs.revision }}

(I've anonymized some identifiers)

These configs deploy successfully, the only problem is the github CI doesn't fail while ECS containers never pass the ALB health check.

Niels Krijger
  • 235
  • 5
  • 9

1 Answers1

1

It seems ECS only considers the "Running" state for success and never the ALB health check; which goes counter to what I've been reading how this is supposed to work.

There's no "success" state that I'm aware of in ECS. I think you are expecting some extra deployment success criteria that doesn't really exist. There is a concept of "services reached a steady state" that indicates the services stopped being created/terminated and the health checks are passing. That is something that can be checked via the AWS CLI tool, or via a Terraform ECS service deployment. However I don't see the same options in the GitHub actions you are using.

On top of that, it starts new tasks even before the one started previously is completely healthy (here too ignoring the ALB health check). I was expecting it to start 1 container at a time (based on the ALB health check).

You aren't showing your service configuration for desired count, and minimim healthy percent, so it is impossible to know exactly what is happening here. It's probably some combination of those settings, plus ECS starting new tasks as soon as the ALB reports the previous tasks as unhealthy that is causing this behavior.


Any reason why you aren't using a Terraform GitHub Action to deploy the updated task definition and update the ECS service? I think one terraform apply GitHub Action would replace the last 4 actions in your GitHub pipeline, keep Terraform updated with your current infrastructure state, and allow you to use the wait_for_steady_state attribute to ensure the deployment is successful before the CI pipeline exits.

Alternatively you could try adding another GitHub action that calls the AWS CLI to wait for the ECS steady state, or possibly for the ALB to have 0 unhealthy targets.

Mark B
  • 183,023
  • 24
  • 297
  • 295
  • Thanks for the reply! I've experimented with wait_for_steady_state today in terraform, and it gave the same result as what I was seeing with the github actions: ` module.ecs_api...: Modifying... [id=arn:aws:ecs:...] module.ecs_api...: Still modifying... [id=arn:aws:ecs:...i, 10s elapsed] ... module.ecs_api.aws_ecs_service.ecs-api: Modifications complete after 46s [id=...] ` Basically the moment it hits 3 running tasks (= desired count) it will succeed no matter what, regardless of health check. I'll check if I can add more details to the post. – Niels Krijger Sep 08 '21 at 20:44
  • OK, you might have to look into waiting for the load balancer target states then. https://awscli.amazonaws.com/v2/documentation/api/latest/reference/elbv2/wait/target-in-service.html – Mark B Sep 08 '21 at 20:46
  • Thanks to your comment I've tracked it a bit further: https://github.com/hashicorp/terraform-provider-aws/issues/16012 From the behaviour I'm seeing on my end it matches that issue exactly. Didn't find a solution yet though. – Niels Krijger Sep 08 '21 at 21:23