0

I've been with a weird problem for some days. I'm implementing the ECS logic to drain instances on termination (specifically on Spot interruption notice) using the ECS_ENABLE_SPOT_INSTANCE_DRAINING=true env var on the ecs-agent.

The process works fine, when an interruption notice arrives, ECS drains the instance and moves the containers to another one, but here is the problem, if the instance never started that image before, it takes too much time to start (About 3 min, when the spot interruption time is in 2 min) causing availability issues. If the image started in that instance before, it only takes 20 sec to spin up the task!

Have you experienced this problem before using ECS?

PD: The images are about 500MB is that large for an image??

DGomez
  • 1,450
  • 9
  • 25
  • Is this fargate or EC2 backed ECS cluster? If EC2, then try to connect to the machine via ssh/ssm and attempt to `docker pull` the image. Also review the ecs logs in the machine for any errors. – Shoan Aug 16 '22 at 12:14
  • Reducing the image size improve the load time, but still is weird that it takes much more time the first time that it loads the image on that instance.... – DGomez Aug 17 '22 at 20:40

1 Answers1

0

There are some strategies available to you:

  1. Reduce the size of the image by optimising the Dockerfile. A smaller image is quicker to pull from the repository.
  2. Bake the large image into the AMI used in the cluster. Now every new spot machine will have the image. Depending on how the Dockerfile is created, a significant number of layers could be reused resulting on quicker image pulls.

Once the image is pulled to the machine, the image is cached and subsequent pulls will almost be instantaneous.

Shoan
  • 4,003
  • 1
  • 26
  • 29