1

Is there any way to use nvidia-docker with Nomad?

The program for computing on Nvidia works locally but it doesn't work with nvidia-docker (it uses CPU instead of GPU).

What is the preferred way to do that?

  • Use nvidia-docker driver for Nomad
  • Use raw docker exec to run nvidia-docker
  • Somehow connect Nomad to nvidia-docker engine

Has anyone experience with that?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Kamil Lelonek
  • 14,592
  • 14
  • 66
  • 90

3 Answers3

1

This is something I've spent a lot of time implementing, at this time (nomad 0.7.0 though running 5.6 myself) there is no 'nomadic' way to implement a nvidia-docker job without using raw fork/exec which doesn't provide container orchestration, service discovery, log shipping, resource management (i.e. bin packing).

I was surprised that the nvidia-docker command doesn't actually act on behalf of docker, alternatively it forwards commands to docker. The only time it is really ever useful is when calling the run/exec commands (ie nvidia-docker run --yar blar) as it calls a helper program that returns a json response with the appropriate device and volume mounts in json format. When the container data is sent to the actual docker socket it includes the correct device and volume for the version of cuda that is installed on the host (inspect your container ).

The other part of implementing this solution using an exec driver is to create a task that acts on behalf of the deployment if you wish to have rolling deploys. I am using a simple script to orchestrate a rolling deploy inside the same task group as the nvidia-docker task. So long as you are using stagger, max parallel (set to 1) in your task group and ensuring you have a dynamic argument like random or date in the orchestration task (nomad will not update the task if there are 0 differences) you should be set.

Once nomad has the ability to compute gpu (need custom fingerprint here: https://github.com/hashicorp/nomad/tree/master/client/fingerprint ) resource type and has the ability to mount non block type devices (ie something not a disk) it should be possible to circumvent using nvidia-docker. I hope this helps, be sure to bump the feature request here:

https://github.com/hashicorp/nomad/issues/2938

To also expand on running this operation using conventional docker, you must also mount the volume created by nvidia-docker. docker volume ls will show named volumes, you must mount the cuda volume for your container to have access to the drivers (unless you are already stuffing into your container, not recommended).

0

This has native support as of Nomad 0.9: https://www.hashicorp.com/blog/using-hashicorp-nomad-to-schedule-gpu-workloads

Chris Baker
  • 153
  • 1
  • 7
-2

The idea was to create a proper Docker image for that:

FROM debian:wheezy

# Run Ubuntu in non-interactive mode
ENV DEBIAN_FRONTEND noninteractive

# Provide CUDA environmental variables that match the installed version on host machine
ENV CUDA_DRIVER  375.39
ENV CUDA_INSTALL http://us.download.nvidia.com/XFree86/Linux-x86_64/${CUDA_DRIVER}/NVIDIA-Linux-x86_64-${CUDA_DRIVER}.run

# Configure dependencies
RUN \
# Update available packages
  apt-get update \
            --quiet \
# Install all requirements
  && apt-get install \
            --yes \
            --no-install-recommends \
            --no-install-suggests \
       build-essential \
       module-init-tools \
       wget \
# Clean up leftovers
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*

# Install CUDA drivers
RUN wget \
      $CUDA_INSTALL \
        -P /tmp \
        --no-verbose \
      && chmod +x /tmp/NVIDIA-Linux-x86_64-${CUDA_DRIVER}.run \
      && /tmp/NVIDIA-Linux-x86_64-${CUDA_DRIVER}.run \
        -s \
        -N \
        --no-kernel-module \
      && rm -rf /tmp/*

ENTRYPOINT ["/bin/bash"]

and then:

  1. build the base Docker image:

    docker build . -t cuda
    
  2. start a container with cuda base image execute:

    docker run \
      --device=/dev/nvidia0:/dev/nvidia0 \
      --device=/dev/nvidiactl:/dev/nvidiactl \
      --device=/dev/nvidia-uvm:/dev/nvidia-uvm \
      -it \
      --rm cuda
    

A message like:

Failed to initialize NVML: Unknown Error

Could be due to a mismatch between the host and container driver versions of missing host /dev entries.

Kamil Lelonek
  • 14,592
  • 14
  • 66
  • 90
  • You're installing the nvidia driver inside the docker image/dockerfile. That voids the intent of `nvidia-docker`. The point of `nvidia-docker` is so that there will be no mismatch between the host and container driver versions. – Robert Crovella Sep 02 '17 at 11:25