Jenkins slave running in ECS cluster can not start container

Question

I'm using jenkins slave in AWS ECS cluster, I config like this web: Jenkins in ECS.

Normally it works well, but sometimes in rush hour, the slave container starts very slow, more than 40mins, or even can not start container.

I have to terminated the ECS instance, then launch a new one. When the container cannot start I saw a logs in ecs-agent:

STOPPED, Reason CannotCreateContainerError: API error (500): devmapper: Thin Pool has 788 free data blocks which is less than minimum required 4454 free data blocks. Create more free space in thin pool or use dm.min_free_space option to change behavior

Here is my docker info, please advise me how to fix this issue.

[root@ip-10-124-2-159 ec2-user]# docker info
Containers: 10
 Running: 1
 Paused: 0
 Stopped: 9
Images: 2
Server Version: 1.12.6
Storage Driver: devicemapper
 Pool Name: docker-docker--pool
 Pool Blocksize: 524.3 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: ext4
 Data file:
 Metadata file:
 Data Space Used: 8.646 GB
 Data Space Total: 23.35 GB
 Data Space Available: 14.71 GB
 Metadata Space Used: 2.351 MB
 Metadata Space Total: 25.17 MB
 Metadata Space Available: 22.81 MB
 Thin Pool Minimum Free Space: 2.335 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.93-RHEL7 (2015-01-28)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options:
Kernel Version: 4.4.39-34.54.amzn1.x86_64
Operating System: Amazon Linux AMI 2016.09
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.8 GiB
Name: ip-10-124-2-159
ID: 6HVT:TWH3:YP6T:GMZO:23TM:EUAA:F7XJ:ISII:QDE7:V2SN:XKFI:XPGZ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8

And I don't know why only 4 tasks can be run at the same time, even resource of ECS instance still available, how can I increase it

score 5 · Accepted Answer · answered Jun 14 '17 at 22:06

Your problem is a very common one when you start and stop containers very often, and the post you just mentioned is all about that! They specifically say that:

"The Amazon EC2 Container Service Plugin can launch containers on your ECS cluster that automatically register themselves as Jenkins slaves, execute the appropriate Jenkins job on the container, and then automatically remove the container/build slave afterwards"

The problem with this is that, if the stopped containers are not cleaned up, you eventually run out of memory, as you have experienced. You can check this yourself if you ssh into the instance and run the following command:

docker ps -a

If you run this command when Jenkins is getting in trouble, you should see an almost endless list of stopped containers. You can delete them all by running the following command:

docker rm -f $(docker ps -a -f status-exited)

However, doing this manually every so often is really not very convenient, so what you really want to do is to include the following script in the userData parameter of you ECS instance configuration when you launch it:

ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=1m >> /etc/ecs/ecs.config
ECS_CLUSTER=<NAME_OF_CLUSTER> >> /etc/ecs/ecs.config
ECS_DISABLE_IMAGE_CLEANUP=false >> /etc/ecs/ecs.config
ECS_IMAGE_CLEANUP_INTERVAL=10m >> /etc/ecs/ecs.config
ECS_IMAGE_MINIMUM_CLEANUP_AGE=30m >> /etc/ecs/ecs.config

This will instruct the ECS agent to enable a cleanup daemon that checks every 10 minutes (that is the lowest interval you can set) for images to delete, deletes containers 1 minute after the task has stopped, and deletes images which are 30 minutes old and no longer referenced by an active Task Definition. You can learn more about these variables here.

In my experience, this configuration might not be enough if you start and stop containers very fast, so you may want to attach a decent volume to your instance in order to make sure you have enough space to carry on while the daemon cleans up the stopped containers.

That perfect, I will add ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=1m >> /etc/ecs/ecs.config to user data. Due to my ECS instance only use 1 images, so no need to auto remove images. — Tien Dung Tran, Jun 15 '17 at 08:44
In my experience, this configuration might not be enough if you start and stop containers very fast, so you may want to attach a decent volume to your instance in order to make sure you have enough space to carry on while the daemon cleans up the stopped containers => you mean, if I increase storage of ECS instance, it can be run more than 4 task definitions at the same time? — Tien Dung Tran, Jun 15 '17 at 08:45
Sort of but not exactly :). In order to run more task definitions at the same time, generally you need more CPU and more memory. However containers also consume disk space, including stopped containers. So if the cleanup mechanism is not fast enough, at some point you run out of space to define new containers. I hope this helps! — Jose Haro Peralta, Jun 15 '17 at 09:05

score 1 · Answer 2 · answered Dec 28 '17 at 05:44

1

Thanks Jose for the answer.

But, this command worked for me in Docker 1.12.*

docker rm $(docker ps -aqf "status=exited")

flag 'q' filters the containerIds from the result and removes it.

answered Dec 28 '17 at 05:44

Balaji Radhakrishnan

1,010
2
14
28

score 0 · Answer 3 · answered Dec 29 '17 at 16:24

If you upgrade to latest AWS client (or latest ECS AMIs, amzn-ami-2017.09.d-amazon-ecs-optimized or later) then you configure ECS automated cleanup of defunct images, containers and volumes in your ecs config for the EC hosts serving the cluster.

This cleans up after and node(label){} clause but not docker execution during that build.

node container and its volumes - cleaned
docker images generated by steps executed upon that node - not cleaned

ECS is blind to what happens on that node. Given that the nodes themselves should be the largest things, ECS automated clean up should reduce the need to run a separate cleaning task to a minimum.

Jenkins slave running in ECS cluster can not start container

3 Answers3