Using tensorflow with VERBS support without infiniband device

Question

I am interested in the RDMA support in tensorflow 1.15 for workers and parameter servers to communicate directly without going through CPU. I do not have infiniband VERBS devices but can build tensorflow from source with VERBS support

bazel build --config=opt --config=cuda --config=verbs //tensorflow/tools/pip_package:build_pip_package

after sudo yum install libibverbs-devel on centos-7. However, after pip installing the built package via

./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg && pip install /tmp/tensorflow_pkg/tensorflow-1.15.0-cp36-cp36m-linux_x86_64.whl,

my training failed with the following error:

F tensorflow/contrib/verbs/rdma.cc:127] Check failed: dev_list No InfiniBand device found

This is expected since I do not have infiniband hardware on my machine. But do I really need infiniband if my job is run not cross-machine, but on a single machine? I just want to test whether RDMA can significantly speed up parameter server-based training. Thanks.

This isn't RDMA. If you are transferring data from one GPU memory to another GPU memory in the same machine, GPUDirect RDMA is not involved. Neither is infiniband. — Robert Crovella, Jul 20 '20 at 18:49
@RobertCrovella Do you know of any framework that supports arbitrary GPU memory to GPU memory transfer, that's more flexible than that offered by NCCL? — John Jiang, Jul 20 '20 at 18:51
NCCL is a collectives library. If you are doing collective operations, it should be pretty good and is included in most DL frameworks. For arbitrary communications, I would imagine you would just use GPUDirect P2P directly, or else CUDA IPC. I don't have specific recipes or instructions. — Robert Crovella, Jul 20 '20 at 18:54

talonmies · Answer 1 · 2021-06-16T02:14:17.120

But do I really need Infiniband if my job is run not cross-machine, but on a single machine?

No and it seems you misunderstand what RDMA actually is. RDMA ("GPUDirect") is a way for third party devices like network interfaces and storage adapters to directly write to a GPUs memory across the PCI bus. It is intended for improved multi-node performance in things like compute clusters. It has nothing to do with multi-GPU operations within in a single node ("peer-to-peer"), where GPUs connected to one node can directly access one another's memory without trips to the host CPU.

Using tensorflow with VERBS support without infiniband device

1 Answers1