InfiniBand network performance

Question

I am measuring the performance of InfiniBand using iperf.

It's a one-to-one connection between a server and a client.

I measured the bandwidth changing number of threads which request Network I/Os.

( The cluster server has:

"Mellanox ConnectX-3 FDR VPI IB/E Adapter for System x" and
"Infiniband 40 Gb Ethernet / FDR InfiniBand" )

Bandwidth:

 1 thread  : 1.34 GB/sec,
 2 threads : 1.55 GB/sec ~ 1.75 GB/sec,
 4 threads : 2.38 GB/sec,
 8 threads : 2.03 GB/sec,
16 threads : 2.00 GB/sec,
32 threads : 1.83 GB/sec.

As you see above, Bandwidth goes up until 4 threads and decreases after it.
Could you give me some ideas in understanding what's happening there?

Additionally what happens once many machines send data to one machine? _(contention)
Can InfiniBand handle that too?

Regarding your second question. IB is mostly lose-less, if you send more traffic to a host than it can handle, congestion signal will spread up to the source and your sending nodes will slow down. — alnet, Aug 31 '15 at 14:32

score 2 · Accepted Answer · answered Nov 22 '15 at 19:30

There are alot of things going under the covers here. But one of the biggest bottlenecks in infiniband is the QP cache in the firmware.

The firmware has a very very small QP cache (of the order of 16 - 32) depending upon which adaptor you are using. When the number of active Qps exceeds this cache, then any benefit of using IB starts to degenerate. From what I know, the performance penalty for a cache miss is of the order of mili seconds.. yes thats right.. milliseconds..

There are many other caches involved.

Ib has multiple different transports, with 2 most common being: 1. RC - Reliable Connected 2. UD - Unreliable Datagram

Reliable Connected mode is somewhat like TCP in that it requires an explicit connection, and is point 2 point between 2 processes. Each process allocates a QP (Queue Pair) which is similar to a socket in the ethernet world. But QP is a much more expensive and resource than a socket for many different reasons.

UD : unreliable datagram mode is like UDP in that it does not need a connection. A sing UD Qp can talk to any number of remote UD Qps.

If your data model is 1 to many.. i.e 1 machine to many machines and you need a reliable connection with huge data sizes, then you are out of luck. IB starts losing some of its effectiveness.

If you have the resources to build a reliable layer on top, then use UD for getting scalability.

If you data model is 1 to many, but the many remote processes reside on the same machine, then you can use RDS (reliable Datagram service) which is a Socket interface to use Infiniband and multiplexes many connections over a single RC connections between 2 machines. (RDS has its own set of weird issues but its a start..)

There is a 3rd newish transport called XRC which mitigates some scalability issues as well but has its own caveats.

thanks for your helpful explanations :D I didn't know whether IB has such a small number of queues. 1). Could you point me to some references, if you know any, that I can find useful information on using IB? 2). I am using Intel's MPI Library which can exploit Infiniband's RDMA operation, too. The data model is intended mostly for 1-to-1 communication, but in the worst case there can be a contention on a node resulting in 1-to-many communications. — syko, Nov 23 '15 at 06:35
here is my concern: (1). Is there any benefit from using multiple threads in one-to-one connection? i.e. small size message (~8 bytes), very large size message (4MB ~) (2). Let's say that I have two disjoint sets of processes in my ib network. Does their performance can affect each other? — syko, Nov 23 '15 at 06:35
Depends on your use case. If you application is latency sensitive and if the ordering of these small and large messages matter. Having small messages queued up behind larger ones on the same connection will affect latency. It will also affect your overall bandwidth for larger messages because the overhead in dispatching the work is somewhat same for different message sizes. So if you can use 2 connections, one for small and other for large, you will get higher B/W and lower latency. If answers were useful i encourage up voting them? — Loki, Nov 23 '15 at 19:49

Mark Sherred · Answer 2 · 2018-01-16T02:14:38.463

1

Since iperf uses TCP, it will not get all the bandwidth possible with native Infiniband.

How many cores does your CPU have? Once the number of threads exceeds the number of cores, threads get time slices to run serially on the same cores, instead of running in parallel. They start getting in each others way.

edited Jan 16 '18 at 02:14

answered Jan 15 '18 at 23:16

Mark Sherred

99
5

InfiniBand network performance

2 Answers2