2

Every time I submit a batch job, does a new Docker container get created or the old container will be reused.

If a new Docker container is created every time, what happens to the container when the job is done.

In AWS ECS, ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION variable sets the time duration to wait from when a task is stopped until the Docker container is removed(by default 3 hours)

If all these containers only get cleanup after three hours, wouldn't the ECS container instance get filled up quick easily if I submit a lot of jobs?

Getting this error CannotCreateContainerError: API error (500): devmapper when running a batch job. Does it help if I clean up the docker container files at the end of the job?

Reza Mousavi
  • 4,420
  • 5
  • 31
  • 48
Harry Su
  • 169
  • 4
  • 11

1 Answers1

1

Every time I submit a batch job, does a new Docker container get created or the old container will be reused.

Yes. Each job run on Batch will be run as a new ECS Task, meaning a new container for each job.

If all these containers only get cleanup after three hours, wouldn't the ECS container instance get filled up quick easily if I submit a lot of jobs?

This all depends on your job workloads, lengths, of jobs, disk usage, etc. With large quantities of short jobs that consume disk, this is entirely possible.

CannotCreateContainerError: API error (500): devmapper

Documentation for this error indicates a few possible solutions, however the first, which you've already called out may not help in this case.

ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION which defaults to 3h on ECS, seems to be set to 2m by default on Batch Clusters - you can inspect the EC2 User Data on one of your batch instances to validate that it is set this way on your clusters. Depending on the age of the cluster, these settings may change. Batch does not automatically update to the latest ECS Optimized AMI without creation of a whole new cluster, so I would not be surprised if it does not change settings either.

If your cleanup duration setting is currently set low, you might try creating a custom AMI which provisions a larger than normal docker volume. By default, the ECS optimized AMIs ship with an 8GB root drive, and 22GB volume for docker.

Luke Waite
  • 2,265
  • 1
  • 25
  • 23
  • Thanks. `ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION` is indeed set to 2m by Batch. Not sure what cause the unused containers accumulated. But after recreating EC2 instance within ECS, containers get clean up properly – Harry Su Sep 11 '18 at 06:43
  • The root cause turns out ECS container agent lives in a docker container, and all the log it generate is filling up the space. I configure the ECS task definition to send docker logs to cloudwatch. But the thing is ECS container agent itself is not configured by task definition. Any idea how to change log driver for the ECS container agent? – Harry Su Sep 13 '18 at 00:35
  • That is intriguing.. It looks like you can change the container agent to log with CloudWatch Logs if you check out this guide: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_cloudwatch_logs.html#configure_cwl_agent Two ways you can implement what the above requires, to my knowledge - a custom AMI, or via EC2 Userdata if you use an un-managed compute environment. – Luke Waite Sep 13 '18 at 00:42
  • Looks like I need to provide {cluster} and {container_instance_id} to set up CloudWatch Logs Agent, which is not possible for custom AMI. As for userdata, I am using batch and it's an managed compute environment so it's also not doable? – Harry Su Sep 13 '18 at 06:00
  • Also, not just the container agent logs are taking up space, which is store in /var/log/ecs/ecs-agent.log, container agent as a docker container also create a huge log file, in this case it's store in /var/lib/docker/containers//-json.log. – Harry Su Sep 13 '18 at 06:09