I am interested in the RDMA support in tensorflow 1.15 for workers and parameter servers to communicate directly without going through CPU. I do not have infiniband VERBS devices but can build tensorflow from source with VERBS support
bazel build --config=opt --config=cuda --config=verbs //tensorflow/tools/pip_package:build_pip_package
after sudo yum install libibverbs-devel
on centos-7. However, after pip installing the built package via
./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg && pip install /tmp/tensorflow_pkg/tensorflow-1.15.0-cp36-cp36m-linux_x86_64.whl,
my training failed with the following error:
F tensorflow/contrib/verbs/rdma.cc:127] Check failed: dev_list No InfiniBand device found
This is expected since I do not have infiniband hardware on my machine. But do I really need infiniband if my job is run not cross-machine, but on a single machine? I just want to test whether RDMA can significantly speed up parameter server-based training. Thanks.