1

I saw this solution, but it doesn't quite answer my question; it's also quite old so I'm not sure how relevant it is.

I keep getting conflicting outputs for the order of GPU units. There are two of them: Tesla K40 and NVS315 (legacy device that is never used). When I run deviceQuery, I get

Device 0: "Tesla K40m"
...
Device PCI Domain ID / Bus ID / location ID:   0 / 4 / 0

Device 1: "NVS 315"
...
Device PCI Domain ID / Bus ID / location ID:   0 / 3 / 0

On the other hand, nvidia-smi produces a different order:

 0  NVS 315 

 1  Tesla K40m

Which I find very confusing. The solution I found for Tensorflow (and a similar one for Pytorch) is to use

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"  
os.environ["CUDA_VISIBLE_DEVICES"]="0"

PCI Bus ID is 4 for Tesla and 3 for NVS, so it should set it to 3 (NVS), is that right?

In pytorch I set

os.environ['CUDA_VISIBLE_DEVICES']='0'
...
device = torch.cuda.device(0)
print torch.cuda.get_device_name(0)

to get Tesla K40m

when I set instead

os.environ['CUDA_VISIBLE_DEVICES']='1'
device = torch.cuda.device(1)
print torch.cuda.get_device_name(0)

to get

UserWarning: 
    Found GPU0 NVS 315 which is of cuda capability 2.1.
    PyTorch no longer supports this GPU because it is too old.

  warnings.warn(old_gpu_warn % (d, name, major, capability[1]))
NVS 315

So I'm quite confused: what's the true order of GPU devices that tf and pytorch use?

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
Alex
  • 944
  • 4
  • 15
  • 28
  • 1
    If you are going to use pytorch or tensorflow, the order that you needs is that `nvidia-smi` shows. Because the nvidia drivers are going to help you 'run' the deep neural networks over GPUs, so you need this order. – David Jimenez Oct 15 '18 at 11:34
  • I edited the question, the second case should be os.environ['CUDA_VISIBLE_DEVICES']='1'. So this is the order by default: Tesla=0, NVS=1. Only when I set os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID' the order reverses – Alex Oct 15 '18 at 11:45

1 Answers1

12

By default, CUDA orders the GPUs by computing power. GPU:0 will be the fastest GPU on your host, in your case the K40m.

If you set CUDA_DEVICE_ORDER='PCI_BUS_ID' then CUDA orders your GPU depending on how you set up your machine meaning that GPU:0 will be the GPU on your first PCI-E lane.

Both Tensorflow and PyTorch use the CUDA GPU order. That is consistent with what you showed:

os.environ['CUDA_VISIBLE_DEVICES']='0'
...
device = torch.cuda.device(0)
print torch.cuda.get_device_name(0)

Default order so GPU:0 is the K40m since it is the most powerful card on your host.

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"  
os.environ['CUDA_VISIBLE_DEVICES']='0'
...
device = torch.cuda.device(0)
print torch.cuda.get_device_name(0)

PCI-E lane order, so GPU:0 is the card with the lowest bus-id in your case the NVS.

Olivier Dehaene
  • 1,620
  • 11
  • 15