0

I'm teaching myself what is docker and how to use it. I'm really new to docker, so hope to learn some basics here.

I installed nvidia-docker (following installation guide) and tensorflow/tensorflow:nightly-gpu-py3 (nightly-gpu, start GPU (CUDA) container) on my computer.

  • Docker: NVIDIA Docker 2.0.3, Version: 17.12.1-ce
  • Host OS: Ubuntu 16.04 Desktop
  • Host Arch: amd64

My Probelm

Both cifar10_multi_gpu_train (written in python with tensorflow) and simple monte-carlo simulation (written in pure cuda) fail to run (fatal error: no curand.h) while fdm (written in in pure cuda) or simple matrix multiplication (written in python with tensorflow) work in the container (tensorflow/tensorflow:nightly-gpu-py3).

Codes that only use CPUs (like a3c) work fine with tensorflow.

Some codes that use GPUs returns an error message. (when code use <curand.h>)

Details

In the container (tensorflow/tensorflow:nightly-gpu-py3), when i run the monte-carlo simulation, i get the following error:

fatal error: curand.h: No such file or directory


locate curand.h returns nothing, but when I try locate curand, I get:

/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcurand.so.9.0
/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcurand.so.9.0.176
/usr/share/doc/cuda-curand-9-0
/usr/share/doc/cuda-curand-9-0/changelog.Debian.gz
/usr/share/doc/cuda-curand-9-0/copyright
/var/lib/dpkg/info/cuda-curand-9-0.list
/var/lib/dpkg/info/cuda-curand-9-0.md5sums
/var/lib/dpkg/info/cuda-curand-9-0.postinst
/var/lib/dpkg/info/cuda-curand-9-0.postrm
/var/lib/dpkg/info/cuda-curand-9-0.shlibs

and for locate cudnn.h:

/usr/local/lib/python3.5/dist-packages/tensorflow/include/tensorflow/core/util/use_cudnn.h

for locate cuda.h:

/usr/include/linux/cuda.h
/usr/local/cuda-9.0/targets/x86_64-linux/include/cuda.h
/usr/local/cuda-9.0/targets/x86_64-linux/include/dynlink_cuda.h
/usr/local/cuda-9.0/targets/x86_64-linux/include/dynlink_cuda_cuda.h
/usr/local/lib/python3.5/dist-packages/tensorflow/include/tensorflow/core/platform/cuda.h
/usr/local/lib/python3.5/dist-packages/tensorflow/include/tensorflow/core/platform/stream_executor_no_cuda.h


nvcc --version returns:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176


In the host (outside the container), when I try nvidia-docker run nvidia/cuda nvidia-smi, I get

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:03:00.0  On |                  N/A |
|  0%   48C    P8    22W / 250W |    301MiB / 11177MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:81:00.0 Off |                  N/A |
|  0%   51C    P8    22W / 250W |      2MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

What I've Done

  1. Reinstall nvidia-docker, nightly-gpu-py3 and #include <curand.h>--> failed

  2. In nightly-gpu-py3 container, reinstall cuda/cuda toolki and #include <curand.h>--> failed

  3. Tried to run all codes in other machine that does not use docker and cuda/tensorflow-gpu are already installed. They work fine.

I guess I totally misunderstand the concept of nvidia-docker and what images/containers do.

Question

  1. After I installed nvidia-docker, I can run a container using nvidia-docker run <myImage>. Isn't docker image means it can save dependencies (PATH, packages, ...) to run a certain code (in my case, code that uses <curand.h>)? (and container do actual work?)
  2. Does tensorflow/tensorflow:nightly-gpu-py3 image has CUDA Toolkit/cuDNN? Does no <curand.h> in nightly-gpu-py3 means I installed/donwloaded nvidia-docker/nightly-gpu-py3 improperly?
  3. Installing CUDA Toolkit or reinstalling cuda inside the container (nightly-gpu-py3) has failed (I followed process here). Is there any way that I can use <curand.h> inside the container (nightly-gpu-py3)?
  4. sudo nvidia-docker run -it --rm -p 8888:8888 -p 6006:6006 <image> /bin/bash is my command to start a new container with given image. Could it be a problem?

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • 2
    From [here](https://hub.docker.com/r/tensorflow/tensorflow/) "**Devel** docker images include all the necessary dependencies to build from source whereas the other binaries simply have TensorFlow installed." You're using a non-devel image. Try using a devel image if you want the full CUDA toolkit available. in your case that would be `nightly-devel-gpu-py3` instead of `nightly-gpu-py3` – Robert Crovella Apr 19 '18 at 14:18
  • I did this: `nvidia-docker run -it --rm -p 8888:8888 tensorflow/tensorflow:nightly-devel-gpu-py3 /bin/bash` and then looked at `/usr/local/cuda/include` in that container, and it did indeed have `curand.h` and appears to be a full CUDA toolkit install. – Robert Crovella Apr 19 '18 at 14:44

0 Answers0