Tensorflow execution and memory on different GPUs

Question

Occasionally when I run TensorFlow using a single GPU but in a multiple GPU setup, the code will execute on one GPU, but allocate memory on another. This, for obvious reasons, causes a major slowdown.

As an example, see the below result of nvidia-smi. Here, a collegue of mine is using gpus 0 and 1 (processes 32918 and 33112), and I start TensorFlow with the following commands (before import tensorflow)

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)

where gpu_id = 2, 3 and 4 respectively for my three processes. As we can see, the memory is correctly allocated on gpus 2, 3 and 4 but the code is executed somewhere else! In this case, on gpus 0, 1 and 7.

Wed May 17 17:04:01 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:04:00.0     Off |                    0 |
| N/A   41C    P0    75W / 149W |    278MiB / 11439MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:05:00.0     Off |                    0 |
| N/A   36C    P0    89W / 149W |    278MiB / 11439MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 0000:08:00.0     Off |                    0 |
| N/A   61C    P0    58W / 149W |   6265MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 0000:09:00.0     Off |                    0 |
| N/A   42C    P0    70W / 149W |   8313MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           Off  | 0000:84:00.0     Off |                    0 |
| N/A   51C    P0    55W / 149W |   8311MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           Off  | 0000:85:00.0     Off |                    0 |
| N/A   29C    P0    68W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           Off  | 0000:88:00.0     Off |                    0 |
| N/A   31C    P0    54W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           Off  | 0000:89:00.0     Off |                    0 |
| N/A   27C    P0    68W / 149W |      0MiB / 11439MiB |     33%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     32918    C   python                                         274MiB |
|    1     33112    C   python                                         274MiB |
|    2     34891    C   ...sadl/anaconda3/envs/tensorflow/bin/python  6259MiB |
|    3     34989    C   ...sadl/anaconda3/envs/tensorflow/bin/python  8309MiB |
|    4     35075    C   ...sadl/anaconda3/envs/tensorflow/bin/python  8307MiB |
+-----------------------------------------------------------------------------+

It seems that tensorflow, for some reason, is partially ignoring the "CUDA_VISIBLE_DEVICES" option.

I am not using any device placement commands in the code.

This was experianced with TensorFlow 1.1 running on ubuntu 16.04 and has happend to me across a range of different scenarios.

Is there some known scenario in which this could happen? If so, is there anything I can do about it?

The display is telling you that there is no process allocating memory on GPU 7 yet GPU 7 is chugging along at 33% utilization. Whose work is it doing? Seems like some kind of driver/display bug — Yaroslav Bulatov, May 17 '17 at 17:04
This is a server running from command line so there is no display even attached to the machine. It could be some driver bug regardless, but I'm not quite sure how it would happen. — Jonas Adler, May 17 '17 at 19:47

score 1 · Answer 1 · edited May 23 '17 at 12:26

1

One of the possible reasons would be "nvidia-smi".

nvidia-smi order is not as same as GPU Ids.

"It is recommended that users desiring consistency use either UUDI or PCI bus ID, since device enumeration ordering is not guaranteed to be consistent"

"FASTEST_FIRST causes CUDA to guess which device is fastest using a simple heuristic, and make that device 0, leaving the order of the rest of the devices unspecified. PCI_BUS_ID orders devices by PCI bus ID in ascending order."

Have a look here : http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars

Also discussed here : Inconsistency of IDs between 'nvidia-smi -L' and cuDeviceGetName()

edited May 23 '17 at 12:26

Community

1
1

answered May 17 '17 at 16:05

Harsha Pokkalla

1,792
1
12
17

Thanks for the information, but how can this explain the fact that allocations seem to be doing this correctly, but execution is not? – Jonas Adler May 17 '17 at 16:34
1

Yeah, I am also bit confused about that. – Harsha Pokkalla May 17 '17 at 17:02

score 0 · Accepted Answer · answered Jul 14 '17 at 12:55

I solved the issue.

It seems the problem had to do with nvidia-smi and not tensorflow, and if you enable persistence mode on the gpus via sudo nvidia-smi -pm 1, the correct status is shown, e.g. something like:

Fri May 19 15:28:06 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000:04:00.0     Off |                    0 |
| N/A   60C    P0   143W / 149W |   6263MiB / 11439MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 0000:05:00.0     Off |                    0 |
| N/A   46C    P0   136W / 149W |   8311MiB / 11439MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 0000:08:00.0     Off |                    0 |
| N/A   64C    P0   110W / 149W |   8311MiB / 11439MiB |     67%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 0000:09:00.0     Off |                    0 |
| N/A   48C    P0   142W / 149W |   8311MiB / 11439MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           On   | 0000:84:00.0     Off |                    0 |
| N/A   32C    P8    27W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           On   | 0000:85:00.0     Off |                    0 |
| N/A   26C    P8    28W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           On   | 0000:88:00.0     Off |                    0 |
| N/A   28C    P8    26W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           On   | 0000:89:00.0     Off |                    0 |
| N/A   25C    P8    28W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     42840    C   ...sadl/anaconda3/envs/tensorflow/bin/python  6259MiB |
|    1     42878    C   ...sadl/anaconda3/envs/tensorflow/bin/python  8307MiB |
|    2     43264    C   ...sadl/anaconda3/envs/tensorflow/bin/python  8307MiB |
|    3      4721    C   python                                        8307MiB |
+-----------------------------------------------------------------------------+

Thanks for the input in solving this.

Tensorflow execution and memory on different GPUs

2 Answers2