0

I am interested in the RDMA support in tensorflow 1.15 for workers and parameter servers to communicate directly without going through CPU. I do not have infiniband VERBS devices but can build tensorflow from source with VERBS support

bazel build --config=opt --config=cuda --config=verbs //tensorflow/tools/pip_package:build_pip_package

after sudo yum install libibverbs-devel on centos-7. However, after pip installing the built package via

./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg && pip install /tmp/tensorflow_pkg/tensorflow-1.15.0-cp36-cp36m-linux_x86_64.whl,

my training failed with the following error:

F tensorflow/contrib/verbs/rdma.cc:127] Check failed: dev_list No InfiniBand device found

This is expected since I do not have infiniband hardware on my machine. But do I really need infiniband if my job is run not cross-machine, but on a single machine? I just want to test whether RDMA can significantly speed up parameter server-based training. Thanks.

John Jiang
  • 827
  • 1
  • 9
  • 19
  • 4
    This isn't RDMA. If you are transferring data from one GPU memory to another GPU memory in the same machine, GPUDirect RDMA is not involved. Neither is infiniband. – Robert Crovella Jul 20 '20 at 18:49
  • @RobertCrovella Do you know of any framework that supports arbitrary GPU memory to GPU memory transfer, that's more flexible than that offered by NCCL? – John Jiang Jul 20 '20 at 18:51
  • 2
    NCCL is a collectives library. If you are doing collective operations, it should be pretty good and is included in most DL frameworks. For arbitrary communications, I would imagine you would just use GPUDirect P2P directly, or else CUDA IPC. I don't have specific recipes or instructions. – Robert Crovella Jul 20 '20 at 18:54

1 Answers1

2

But do I really need Infiniband if my job is run not cross-machine, but on a single machine?

No and it seems you misunderstand what RDMA actually is. RDMA ("GPUDirect") is a way for third party devices like network interfaces and storage adapters to directly write to a GPUs memory across the PCI bus. It is intended for improved multi-node performance in things like compute clusters. It has nothing to do with multi-GPU operations within in a single node ("peer-to-peer"), where GPUs connected to one node can directly access one another's memory without trips to the host CPU.

talonmies
  • 70,661
  • 34
  • 192
  • 269