Occasionally when I run TensorFlow using a single GPU but in a multiple GPU setup, the code will execute on one GPU, but allocate memory on another. This, for obvious reasons, causes a major slowdown.
As an example, see the below result of nvidia-smi
. Here, a collegue of mine is using gpus 0 and 1 (processes 32918 and 33112), and I start TensorFlow with the following commands (before import tensorflow)
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
where gpu_id = 2, 3 and 4 respectively for my three processes. As we can see, the memory is correctly allocated on gpus 2, 3 and 4 but the code is executed somewhere else! In this case, on gpus 0, 1 and 7.
Wed May 17 17:04:01 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48 Driver Version: 367.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:04:00.0 Off | 0 |
| N/A 41C P0 75W / 149W | 278MiB / 11439MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:05:00.0 Off | 0 |
| N/A 36C P0 89W / 149W | 278MiB / 11439MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0000:08:00.0 Off | 0 |
| N/A 61C P0 58W / 149W | 6265MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0000:09:00.0 Off | 0 |
| N/A 42C P0 70W / 149W | 8313MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 Off | 0000:84:00.0 Off | 0 |
| N/A 51C P0 55W / 149W | 8311MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 Off | 0000:85:00.0 Off | 0 |
| N/A 29C P0 68W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 Off | 0000:88:00.0 Off | 0 |
| N/A 31C P0 54W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 Off | 0000:89:00.0 Off | 0 |
| N/A 27C P0 68W / 149W | 0MiB / 11439MiB | 33% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 32918 C python 274MiB |
| 1 33112 C python 274MiB |
| 2 34891 C ...sadl/anaconda3/envs/tensorflow/bin/python 6259MiB |
| 3 34989 C ...sadl/anaconda3/envs/tensorflow/bin/python 8309MiB |
| 4 35075 C ...sadl/anaconda3/envs/tensorflow/bin/python 8307MiB |
+-----------------------------------------------------------------------------+
It seems that tensorflow, for some reason, is partially ignoring the "CUDA_VISIBLE_DEVICES" option.
I am not using any device placement commands in the code.
This was experianced with TensorFlow 1.1 running on ubuntu 16.04 and has happend to me across a range of different scenarios.
Is there some known scenario in which this could happen? If so, is there anything I can do about it?