0

Is it still possible to run trainning in some kind of multi gpu setting if I have Peer access not supported between device ordinals?(as I understand GPUs are 'not connected') for example by calculating each batch separately on GPU and then merge on CPU as I understand this is the way 'batch accumulation' work in DIGITS with Caffe backend.

Raw output:

2017-05-10 15:27:54.360688: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 0 and 1
2017-05-10 15:27:54.360949: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 0 and 2
2017-05-10 15:27:54.361504: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 0 and 3
2017-05-10 15:27:54.361738: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 1 and 0
2017-05-10 15:27:54.361892: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 1 and 2
2017-05-10 15:27:54.362065: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 1 and 3
2017-05-10 15:27:54.362263: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 2 and 0
2017-05-10 15:27:54.362485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 2 and 1
2017-05-10 15:27:54.362693: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 2 and 3
2017-05-10 15:27:54.362885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 3 and 0
2017-05-10 15:27:54.362927: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 3 and 1
2017-05-10 15:27:54.362967: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 3 and 2
2017-05-10 15:27:54.364638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 1 2 3 
2017-05-10 15:27:54.364668: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y N N N 
2017-05-10 15:27:54.364687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 1:   N Y N N 
2017-05-10 15:27:54.364702: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 2:   N N Y N 
2017-05-10 15:27:54.364717: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 3:   N N N Y 
mrgloom
  • 20,061
  • 36
  • 171
  • 301

1 Answers1

3

This message is benign (it is an "INFO" message, not an error). Everything in Tensorflow will work, but perhaps more slowly than it could on different hardware that did support peer-to-peer access.

The message means the NVIDIA driver is reporting that peer-to-peer access is not possible between your GPUs. See: https://developer.nvidia.com/gpudirect for more information.

You can use the command

nvidia-smi topo -m

to display the bus topology.

Peter Hawkins
  • 3,201
  • 19
  • 17
  • On Windows, nvidia-smi topo -m Invalid combination of input arguments. Please run 'nvidia-smi -h' for help – empty Oct 05 '17 at 21:17
  • @empty and what does `nvidia-smi -h` say ? some programs on windows take arguments with slashes rather than dash. Maybe it's such a case too ? – Ciprian Tomoiagă Oct 09 '17 at 14:43
  • @CiprianTomoiaga nvidia-smi -h gives "NVIDIA System Management Interface -- v385.54" plus the list of options and flags, none of which is 'topo'. The list of options is: dmon, daemon, replay, pmon, nvlink, clocks, encodersessions – empty Oct 23 '17 at 17:48
  • @empty how did you solve the issue? Didi you find another way to get hte topo? – jimifiki Nov 21 '17 at 14:37
  • 1
    @jimifiki nope. – empty Nov 21 '17 at 16:57