Launching jobs with large docker images in mesos via aurora can be slow

Question

When launching a task over mesos via aurora that uses a rather large docker image (~2GB) there is a long wait time before the task actually starts.

Even when the task has been previously launched and we would expect the docker image to already be available to the worker node, there is still a waiting time dependent on image size before the task actually launches. Using docker, you can launch a container almost instantly as long as it is in your images list already, does the mesos containerizer not support this "caching" as well ? Is this functionality something that can be configured ?

I haven't tried using the docker containerizer, but it is my understanding that it will be phased out soon anyway and that gpu resource isolation, which we require, only works for the mesos containerizer.

How long a wait? Even if the image was already built and/or downloaded, if you are creating a new container from an image, it will have to at least read in the image file (which may be composed of many layers), create a container from it, etc, and that alone could take time. It is hard to say since you didn't quantify what "long delay" means and we don't know whether you are creating new containers, re-launching existing containers... — Dan Lowe, Feb 24 '17 at 14:47
In this case, by long wait i mean about ~1 minute for a ~2GB image. I know that when executing this same image with nvidia-docker run, it starts up within only a couple of seconds, whereas launching it as part of an Aurora job it takes about 1 minute until the specified processes start. Also, the wait time when launching a job containing a much smaller image (just a few hundred MB), is noticeably smaller, at ~10 seconds maybe. Now i'm not an expert at how Docker created containers from images, but i would have expected similar behavior from the mesos containerizer — andrei, Feb 24 '17 at 14:54
I'd also like to add that i know for certain that this delay isn't due to reimporting the image or anything like that as i've tried running a scenario where i execute an aurora job that uses a docker image, update this image in the registry and then execute the same job again. In this case, the initial image was used and the updated one was ignored. — andrei, Feb 24 '17 at 14:57

anaken78 · Accepted Answer · 2017-03-14T05:12:59.830

I am assuming you are talking about the unified containerizer running docker images? What is backend you are using? By default the Mesos agents use the copy backend which is why you are seeing it being slow. You can look at the backend the agent is using by hitting flags endpoint on the agent. Switch the backend to aufs or overlayfs to see if speeds up the launch. You can specify the backend through the flag --image_provisioner_backend=VALUE on the agent.

NOTE: There are few bugs fixes related to aufs and overlayfs backend in the latest Mesos release 1.2.0-rc1 that you might want to pick up. Not to mention that there is an autobackend feature in 1.2.0-rc1 that will automatically select the fastest backend available.

This definitely sounds like it's the problem. I'll play around a bit switching the provisioner backend and look up the bugs you mentioned. — andrei, Feb 27 '17 at 06:52

Launching jobs with large docker images in mesos via aurora can be slow

1 Answers1