I'm teaching myself what is docker and how to use it. I'm really new to docker, so hope to learn some basics here.
I installed nvidia-docker (following installation guide) and tensorflow/tensorflow:nightly-gpu-py3 (nightly-gpu, start GPU (CUDA) container) on my computer.
- Docker: NVIDIA Docker 2.0.3, Version: 17.12.1-ce
- Host OS: Ubuntu 16.04 Desktop
- Host Arch: amd64
My Probelm
Both cifar10_multi_gpu_train (written in python with tensorflow) and simple monte-carlo simulation (written in pure cuda) fail to run (fatal error: no curand.h) while fdm (written in in pure cuda) or simple matrix multiplication (written in python with tensorflow) work in the container (tensorflow/tensorflow:nightly-gpu-py3).
Codes that only use CPUs (like a3c) work fine with tensorflow.
Some codes that use GPUs returns an error message. (when code use <curand.h>
)
Details
In the container (tensorflow/tensorflow:nightly-gpu-py3), when i run the monte-carlo simulation, i get the following error:
fatal error: curand.h: No such file or directory
locate curand.h
returns nothing, but when I try locate curand
, I get:
/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcurand.so.9.0
/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcurand.so.9.0.176
/usr/share/doc/cuda-curand-9-0
/usr/share/doc/cuda-curand-9-0/changelog.Debian.gz
/usr/share/doc/cuda-curand-9-0/copyright
/var/lib/dpkg/info/cuda-curand-9-0.list
/var/lib/dpkg/info/cuda-curand-9-0.md5sums
/var/lib/dpkg/info/cuda-curand-9-0.postinst
/var/lib/dpkg/info/cuda-curand-9-0.postrm
/var/lib/dpkg/info/cuda-curand-9-0.shlibs
and for locate cudnn.h
:
/usr/local/lib/python3.5/dist-packages/tensorflow/include/tensorflow/core/util/use_cudnn.h
for locate cuda.h
:
/usr/include/linux/cuda.h
/usr/local/cuda-9.0/targets/x86_64-linux/include/cuda.h
/usr/local/cuda-9.0/targets/x86_64-linux/include/dynlink_cuda.h
/usr/local/cuda-9.0/targets/x86_64-linux/include/dynlink_cuda_cuda.h
/usr/local/lib/python3.5/dist-packages/tensorflow/include/tensorflow/core/platform/cuda.h
/usr/local/lib/python3.5/dist-packages/tensorflow/include/tensorflow/core/platform/stream_executor_no_cuda.h
nvcc --version
returns:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
In the host (outside the container), when I try nvidia-docker run nvidia/cuda nvidia-smi
, I get
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30 Driver Version: 390.30 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:03:00.0 On | N/A |
| 0% 48C P8 22W / 250W | 301MiB / 11177MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:81:00.0 Off | N/A |
| 0% 51C P8 22W / 250W | 2MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
What I've Done
Reinstall nvidia-docker, nightly-gpu-py3 and
#include <curand.h>
--> failedIn nightly-gpu-py3 container, reinstall cuda/cuda toolki and
#include <curand.h>
--> failedTried to run all codes in other machine that does not use docker and cuda/tensorflow-gpu are already installed. They work fine.
I guess I totally misunderstand the concept of nvidia-docker and what images/containers do.
Question
- After I installed nvidia-docker, I can run a container using
nvidia-docker run <myImage>
. Isn't docker image means it can save dependencies (PATH, packages, ...) to run a certain code (in my case, code that uses<curand.h>
)? (and container do actual work?) - Does tensorflow/tensorflow:nightly-gpu-py3 image has CUDA Toolkit/cuDNN? Does no
<curand.h>
in nightly-gpu-py3 means I installed/donwloaded nvidia-docker/nightly-gpu-py3 improperly? - Installing CUDA Toolkit or reinstalling cuda inside the container (nightly-gpu-py3) has failed (I followed process here). Is there any way that I can use
<curand.h>
inside the container (nightly-gpu-py3)? sudo nvidia-docker run -it --rm -p 8888:8888 -p 6006:6006 <image> /bin/bash
is my command to start a new container with given image. Could it be a problem?