I'm new to ECS Fargate and Terraform; I've based most of the config below on tutorials/blogs.
What I'm seeing:
- My app doesn't start because it can't connect to RDS (cloudwatch logs). This is OK since I've not yet configured RDS.
- ECS / Fargate drains the task that failed and creates a new ones.
This behaviour is expected.
But; I expect the deployment to fail because it simply won't boot any of the ECS container successfully (the ALB health check never returns true).
The config I've setup is designed to fail for the following reasons:
- The ALB health_check is configured to match 499 reponse status (which doesn't exist in my app, in fact my app doesn't even have a /health checkpoint!)
- The app doesn't start at all and quits within 10 seconds after booting, not even starting any HTTP service listener
But; the deployment always succeeds despite no container every being alive :-(
What I'm seeing is (assuming the desired app count is 3):
- After deployment the ECS task gets "3 Pending Tasks"
- It will start with "1 Running Task" and "2 Pending Tasks", which fails and goes back to "3 Pending Tasks"
- Frequently it shows "2 Running Tasks", but they will fail and go back to "Pending tasks"
- After a while it will briefly list "3 Running Tasks"
- The moment it shows "3 Running Tasks" the deployment succeeds.
When the ECS lists "3 Running Tasks" none of the ALB health checks ever succeeded; running means it starts the container but it doesn't mean the health check succeeded.
It seems ECS only considers the "Running" state for success and never the ALB health check; which goes counter to what I've been reading how this is supposed to work.
On top of that, it starts new tasks even before the one started previously is completely healthy (here too ignoring the ALB health check). I was expecting it to start 1 container at a time (based on the ALB health check).
There are loads of topics about failing ECS deployments due to failed ELB health checks; but I'm encountering the exact opposite and struggling to find an explanation.
Given I'm new to all this I'm assuming I've made a misconfiguration or have some misunderstanding of how it is supposed to work.
But after more than 12 hours I'm not seeing it...
Hope someone can help!
I've configured the following terraform:
locals {
name = "${lower(var.project)}-${var.env}"
service_name = "${local.name}-api"
port = 3000
}
resource "aws_lb" "api" {
name = "${local.service_name}-lb"
internal = false
load_balancer_type = "application"
tags = var.tags
subnets = var.public_subnets
security_groups = [
aws_security_group.http.id,
aws_security_group.https.id,
aws_security_group.egress-all.id,
]
}
resource "aws_lb_target_group" "api" {
name = local.service_name
port = 3000
protocol = "HTTP"
target_type = "ip"
vpc_id = var.vpc_id
tags = var.tags
health_check {
enabled = true
healthy_threshold = 3
interval = 30
path = "/"
port = "traffic-port"
protocol = "HTTP"
matcher = "499" # This is a silly reponse code, it never succeeds
unhealthy_threshold = 3
}
# NOTE: TF is unable to destroy a target group while a listener is attached,
# therefore create a new one before destroying the old. This also means
# we have to let it have a random name, and then tag it with the desired name.
lifecycle {
create_before_destroy = true
}
depends_on = [aws_lb.api]
}
resource "aws_lb_listener" "api-http" {
load_balancer_arn = aws_lb.api.arn
port = "80"
protocol = "HTTP"
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.api.arn
}
}
# This is the role under which ECS will execute our task. This role becomes more important
# as we add integrations with other AWS services later on.
#
# The assume_role_policy field works with the following aws_iam_policy_document to allow
# ECS tasks to assume this role we're creating.
resource "aws_iam_role" "ecs-alb-role" {
name = "${local.name}-api-alb-role"
assume_role_policy = data.aws_iam_policy_document.ecs-task-assume-role.json
tags = var.tags
}
data "aws_iam_policy_document" "ecs-task-assume-role" {
statement {
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["ecs-tasks.amazonaws.com"]
}
}
}
data "aws_iam_policy" "ecs-alb-role" {
arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}
# Attach the above policy to the execution role.
resource "aws_iam_role_policy_attachment" "ecs-alb-role" {
role = aws_iam_role.ecs-alb-role.name
policy_arn = data.aws_iam_policy.ecs-alb-role.arn
}
# Based on:
# https://section411.com/2019/07/hello-world/
resource "aws_ecs_cluster" "cluster" {
name = "${local.name}-cluster"
tags = var.tags
}
resource "aws_ecs_service" "ecs-api" {
name = local.service_name
task_definition = aws_ecs_task_definition.ecs-api.arn
cluster = aws_ecs_cluster.cluster.id
launch_type = "FARGATE"
desired_count = var.desired_count
tags = var.tags
network_configuration {
assign_public_ip = false
security_groups = [
aws_security_group.api-ingress.id,
aws_security_group.egress-all.id
]
subnets = var.private_subnets
}
load_balancer {
target_group_arn = aws_lb_target_group.api.arn
container_name = var.container_name
container_port = local.port
}
# not sure what this does, it doesn't fix the problem though regardless of true/false
deployment_circuit_breaker {
enable = true
rollback = true
}
}
resource "aws_cloudwatch_log_group" "ecs-api" {
name = "/ecs/${local.service_name}"
tags = var.tags
}
resource "aws_ecs_task_definition" "ecs-api" {
family = local.service_name
execution_role_arn = aws_iam_role.ecs-alb-role.arn
tags = var.tags
# These are the minimum values for Fargate containers.
cpu = 256
memory = 512
requires_compatibilities = ["FARGATE"]
network_mode = "awsvpc"
container_definitions = <<EOF
[
{
"name": "${var.container_name}",
"image": "${var.ecr_url}/${var.container_name}:latest",
"portMappings": [
{
"containerPort": ${local.port}
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-region": "${var.aws_region}",
"awslogs-group": "/ecs/${local.service_name}",
"awslogs-stream-prefix": "ecs"
}
}
}
]
EOF
}
resource "aws_security_group" "http" {
name = "http"
description = "HTTP traffic"
vpc_id = var.vpc_id
tags = var.tags
ingress {
from_port = 80
to_port = 80
protocol = "TCP"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_security_group" "https" {
name = "https"
description = "HTTPS traffic"
vpc_id = var.vpc_id
tags = var.tags
ingress {
from_port = 443
to_port = 443
protocol = "TCP"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_security_group" "egress-all" {
name = "egress_all"
description = "Allow all outbound traffic"
vpc_id = var.vpc_id
tags = var.tags
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_security_group" "api-ingress" {
name = "api_ingress"
description = "Allow ingress to API"
vpc_id = var.vpc_id
tags = var.tags
ingress {
from_port = 3000
to_port = 3000
protocol = "TCP"
cidr_blocks = ["0.0.0.0/0"]
}
}
My github action deploy config:
# This is based on:
# - https://docs.github.com/en/actions/guides/deploying-to-amazon-elastic-container-service
# - https://particule.io/en/blog/cicd-ecr-ecs/
env:
AWS_REGION: eu-west-1
ECR_REPOSITORY: my-service-api
ECS_SERVICE: my-service-dev-api
ECS_CLUSTER: my-service-dev-cluster
TASK_DEFINITION: arn:aws:ecs:eu-west-1:123456789:task-definition/my-service-dev-api
name: Deploy
on:
push:
branches:
- main
jobs:
build:
name: Deploy
runs-on: ubuntu-latest
timeout-minutes: 10
permissions:
packages: write
contents: read
steps:
- name: Checkout
uses: actions/checkout@v2
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@13d241b293754004c80624b5567555c4a39ffbe3
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ env.AWS_REGION }}
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@aaf69d68aa3fb14c1d5a6be9ac61fe15b48453a2
- name: Build, tag, and push image to Amazon ECR
id: build-image
env:
ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
IMAGE_TAG: ${{ github.sha }}
run: |
# Build a docker container and push it to ECR so that it can be deployed to ECS.
docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$GITHUB_RUN_NUMBER .
docker push $ECR_REGISTRY/$ECR_REPOSITORY:$GITHUB_RUN_NUMBER
# Tag docker container with git tag for debugging purposes
docker tag $ECR_REGISTRY/$ECR_REPOSITORY:$GITHUB_RUN_NUMBER $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
# We tag with ":latest" for debugging purposes, but don't use it for deployment
docker tag $ECR_REGISTRY/$ECR_REPOSITORY:$GITHUB_RUN_NUMBER $ECR_REGISTRY/$ECR_REPOSITORY:latest
docker push $ECR_REGISTRY/$ECR_REPOSITORY:latest
echo "::set-output name=image::$ECR_REGISTRY/$ECR_REPOSITORY:$GITHUB_RUN_NUMBER"
- name: Download task definition
id: download-task
run: |
aws ecs describe-task-definition \
--task-definition ${{ env.TASK_DEFINITION }} \
--query taskDefinition > task-definition.json
echo ${{ env.TASK_DEFINITION }}
echo "::set-output name=revision::$(cat task-definition.json | jq .revision)"
- name: Fill in the new image ID in the Amazon ECS task definition
id: task-def
uses: aws-actions/amazon-ecs-render-task-definition@v1
with:
task-definition: task-definition.json
container-name: ${{ env.ECR_REPOSITORY }}
image: ${{ steps.build-image.outputs.image }}
- name: Deploy Amazon ECS task definition
uses: aws-actions/amazon-ecs-deploy-task-definition@v1
with:
task-definition: ${{ steps.task-def.outputs.task-definition }}
service: ${{ env.ECS_SERVICE }}
cluster: ${{ env.ECS_CLUSTER }}
wait-for-service-stability: true
wait-for-minutes: 5
- name: De-register previous revision
run: |
aws ecs deregister-task-definition \
--task-definition ${{ env.TASK_DEFINITION }}:${{ steps.download-task.outputs.revision }}
(I've anonymized some identifiers)
These configs deploy successfully, the only problem is the github CI doesn't fail while ECS containers never pass the ALB health check.