-2

I have an ECS cluster and facing an issue that sometimes one or two EC2 instances ran out of disk which I have to manually cleanup or terminate.

I'd like to have a dashboard or graph to show the disk usage, preferably P90 to trace the worst cases.

Is there a buildin metrics from AWS, or it has to be a customized solution? I use Terraform if that's related.

digit plumber
  • 1,140
  • 2
  • 14
  • 27
  • Possible duplicate of https://stackoverflow.com/q/59888358 – Chaitanya Mar 25 '23 at 17:28
  • Maybe. I do not use Fargate tho. – digit plumber Mar 25 '23 at 18:55
  • I believe the CloudWatch agent should be pre-installed on the EC2 instances. The CloudWatch agent on the instance can report disk volume usage to CloudWatch: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html Then it is just a matter of creating a metric in CloudWatch to alarm when usage is high. – Mark B Mar 26 '23 at 14:42
  • Under a cluster it is a bit awkward, e.g. the instance IDs are not readily available. Even with the instance IDs, another awkward example is that the metrics/alarm has to be cleaned up / deleted after an instance dies. – digit plumber Mar 27 '23 at 17:59

1 Answers1

2

CloudWatch Agent can do this for you. Here's an example Terraform script that accomplishes this derived from this very helpful blog article.

main.tf

# Implementation based on https://jazz-twk.medium.com/cloudwatch-agent-on-ec2-with-terraform-8cf58e8736de

locals {
  role_policy_arns = [
    "arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM",
    "arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"
  ]
  userdata = templatefile("${path.module}/userdata.tpl", {
    ssm_cloudwatch_config = aws_ssm_parameter.cloudwatch_agent.name
  })
}

resource "aws_iam_role_policy_attachment" "cloudwatch_agent" {
  count = length(local.role_policy_arns)

  role       = var.instance_profile_role_name
  policy_arn = element(local.role_policy_arns, count.index)
}

resource "aws_iam_role_policy" "cloudwatch_agent" {
  name = "${var.name}-EC2-Inline-Policy"
  role = var.instance_profile_role_id
  policy = jsonencode(
    {
      "Version" : "2012-10-17",
      "Statement" : [
        {
          "Effect" : "Allow",
          "Action" : [
            "ssm:GetParameter"
          ],
          "Resource" : "*"
        }
      ]
    }
  )
}

resource "aws_ssm_parameter" "cloudwatch_agent" {
  description = "Cloudwatch agent config"
  name        = "/${var.service}/cloudwatch_agent_config.json"
  tags        = var.tags
  type        = "String"

  value = templatefile("${path.module}/config.tpl", {
    namespace               = "${var.service}/CWAgent"
  })
}

variables.tf

variable "name" {
  type        = string
  description = "The name."
}

variable "tags" {
  type        = map(string)
  description = "The tags."
}

variable "service" {
  type        = string
  description = "The service name."
}

variable "instance_profile_role_id" {
  type        = string
  description = "The instance profile role ID."
}

variable "instance_profile_role_name" {
  type        = string
  description = "The instance profile role name."
}

userdata.tpl

# Configure Cloudwatch agent
wget https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm
rpm -U ./amazon-cloudwatch-agent.rpm

# Use cloudwatch config from SSM
/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a fetch-config \
-m ec2 \
-c ssm:${ssm_cloudwatch_config} -s

config.tpl

{
    "agent": {
      "metrics_collection_interval": 60
    },
    "metrics": {
      "namespace": "${namespace}",
      "metrics_collected": {
          "mem": {
            "measurement": [
              "used_percent",
              "total"
            ]
          }
      },
      "append_dimensions": {
          "InstanceId": "$${aws:InstanceId}",
          "AutoScalingGroupName": "$${aws:AutoScalingGroupName}"
      },
      "aggregation_dimensions": [["InstanceId"], ["AutoScalingGroupName"]]
    }
}

Note: My configuration is for retrieving memory usage (something I was surprised to find isn't available by default on EC2 instances) but it's easy to modify it for whatever metrics you are interested in as described here (search for disk to see the disk metrics).

You will need to insert local.userdata into your EC2 instance launch user data for the CloudWatch agent to be properly installed/configured.