Questions tagged [infiniband]

InfiniBand is a high-speed switched fabric communications link technology used in high-performance computing and enterprise data centers.

InfiniBand is a switched fabric communications link used in high-performance computing and enterprise data centers. Its features include scalability, high throughput, low latency, quality of service and failover. The InfiniBand architecture specification defines a connection between processor nodes and high performance I/O nodes such as storage devices. Infiniband host bus adapters and network switches are commonly manufactured by both Mellanox and Intel.

178 questions
0
votes
1 answer

OpenMPI 4.1.1 There was an error initializing an OpenFabrics device Infinband Mellanox MT28908

Similar to the discussion at MPI hello_world to test infiniband, we are using OpenMPI 4.1.1 on RHEL 8 with 5e:00.0 Infiniband controller [0207]: Mellanox Technologies MT28908 Family [ConnectX-6] [15b3:101b], we see this warning with mpirun: WARNING:…
RobbieTheK
  • 178
  • 1
  • 11
0
votes
1 answer

Rdma infiniband cannot open hosts (iberror: discovery failed) Port state: Down

I am facing an issue while configuring rdma and Infiniband on my two nodes. Both of these two nodes are connected and I have installed the recommended software libraries and packages required. But my port status is down and physical state is…
DumbLoawai
  • 11
  • 4
0
votes
1 answer

How does SEND bandwidth improve when the registered memory is aligned to system page size? (In Mellanox IBD)

Operating System: RHEL Centos 7.9 Latest Operation: Sending 500MB chunks 21 times from one System to another connected via Mellanox Cables. (Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]) (The registered memory region…
Vaishakh
  • 67
  • 5
0
votes
0 answers

ibv_post_send performance optimization in KMDF windows driver

I want to use RDMA in a KMDF driver where the buffer received in an EvtIoWrite will be written directly to remote memory but this operation is performance critical so I'm wondering if there is a way to avoid copying the buffer to an RDMA memory…
0
votes
1 answer

what to change in ibverbs when switching from UD to RC connections

I'm looking at a ibverbs code from Mellanox. With a send/recv operation via ibverbs. The code is using UD connections. But it didnt work when I change qp_type = IBV_QPT_UD to IBV_QPT_RC What do I need to change in this case other then the…
0
votes
1 answer

RoCE connection problem with MLNX_OFED (RDMA over Converged Ethernet)

I am trying to get RoCe (RDMA over converged ethernet) to work on two workstations. I have installed MLNX_OFED on both computers which are equiped with Mellanox ConnectX-5 EN 100GbE adapters and connected directly to each other via corresponding…
Fiskrens
  • 11
  • 4
0
votes
1 answer

Using tensorflow with VERBS support without infiniband device

I am interested in the RDMA support in tensorflow 1.15 for workers and parameter servers to communicate directly without going through CPU. I do not have infiniband VERBS devices but can build tensorflow from source with VERBS support bazel build…
John Jiang
  • 827
  • 1
  • 9
  • 19
0
votes
0 answers

How to force open mpi 3 to use TCP

I run a small cluster for mpi computing, and recently we acquired some EDR Infiniband Equipment. I am testing it with two computers, connected through an unmanaged switch, and I am able to run a test program with 30 processes in both nodes.…
xaviote
  • 1
  • 4
0
votes
2 answers

How to test RDMA code without actual hardware?

I have C++ code which makes use of infiniband verbs for RDMA communication. I need to unit test this code, and thus, the function calls related to RDMA such as ibv_get_device_list() need to succeed without any actual hardware. From my understanding,…
xyz123
  • 1
  • 1
0
votes
2 answers

MSN (message sequence number) in response for a retransmitted RDMA Read

While running ib_read_bw test for 64K message sizes from Mellanox CX-4 (request initiator) to another RNIC, re-transmissions are happening from Mellanox for the 5th RDMA-READ on-wards for 50KB of data (first 12KBs has been ACKed successfully), after…
Anji M
  • 11
  • 2
0
votes
2 answers

Setting max outstanding work requests to be put on a Send Queue of a Queue Pair in RDMA

I am trying to create a QueuePair with ibv_create_qp() and I have to describe the size of the Queue Pair by setting the fields of the struct ibv_qp_cap and providing it to the create function. My issue is with the max_send_wr field which corresponds…
kfertakis
  • 127
  • 1
  • 5
0
votes
1 answer

What is the difference between OFED, MLNX OFED and the inbox driver

I'm setting up Infiniband networks, and I do not fully get the difference between the different software stacks. OFED https://www.openfabrics.org/ofed-for-linux/ MLNX OFED…
Jounathaen
  • 803
  • 1
  • 9
  • 23
0
votes
1 answer

Does gRPC+MPI require RDMA?

Tensorflow allows for the options "gRPC", "gRPC+verbs" and "gRPC+mpi" when specifying a communication protocol. In the gRPC+verbs documentation, it clearly states that this protocol is based on RDMA. Meanwhile, in the gRPC+MPI documentation, it…
JRL
  • 31
  • 3
0
votes
1 answer

Will the RDMA enabled NIC do endian conversion?

Is it possible to get an RDMA adapter (e.g. Mellanox NIC) to do an endian conversion during data transfer? Specifically, we're doing an RDMA transfer from a big-endian to a little-endian system and vice versa. Once data lands at the target, then…
B Abali
  • 433
  • 2
  • 10
0
votes
0 answers

error running MPI job on cluster

I am running a code which works perfectly on the cluster, As I increase the number of cores to 3844, I get the following error, "too many retries sending message to 0x0040:0x00152080, giving up" Is this error a network problem? or is this related…
JimBamFeng
  • 709
  • 1
  • 4
  • 20