1

Recently, I executed a watch -n 1 ipconfig on one of our Linux cluster computing nodes while it was running a 48-process MPI run, disributed over several nodes. Oddly, while Ethernet packets seem to be counted correctly (a few kb/s due to the SSH session), it looks like the IB adapter stays idle (no change in RX/TX packets and bytes).

MPI over IB is definitely working on our cluster (we did several checks and anyway people would have noticed if not) and even more strangely, if I ping the InfiniBand-HCA from another node, suddenly packets are counted.

Admittedly my knowledge about IB is quite limited, but I know that one of the key aspects for improved performance with InfiniBand is due to the bypassing of the (kernel) network stack by implementing it directly into hardware (or so I thought - please correct me if I'm wrong!).

My explanation would be that the kernel isn't able to properly intercept the traffic due to missing information in the respective layer as the packets don't reach the kernel - does this sound reasonable? However, I'm not sure what is happening in the ICMP case then. Maybe data sent over IPoIB does trigger the respective kernel routines for packet counting while "IB-native" protocols (verbs, RDMA) do not?

Unfortunately I could not find any information on that matter in the internet.

andreee
  • 133
  • 1
  • 6
  • RX/TX counter are related to the IP stack so these only account for the traffic through the NIC driver provided by the IPoIB stack. MPI is not using IPoIB but directly IB (at the interface level if you will). – sfk Mar 31 '17 at 14:14

1 Answers1

2

You are correct with your assumptions. When running MPI over Infiniband, you normally want to bypass the network stack and use RDMA/Verbs interface to have full performance. All communication sent over this interface will not be accounted on the IPoIB interface ib0 e.g.

To monitor the traffic the Infiniband card is doing, you can see in /sys/class/infiniband/mlx4_0/ports/1/counters/ for counters. Unfortunately those are only 32bit counters which fill up very quickly in Infiniband, so you should have installed perfquery which can collect the performance counters in your fabric with 64bit counters.

To do a simple query with perfquery locally on a node and get the 64bit counters you could issue the command as follows.

perfquery -x 

You can also get the performance counters of a remote machine by adding the LID of the remove Infiniband device.

perfquery -x -a 2

Where -a says all ports of LID 2.

Please note that the PortXmitData and PortRcvData are per lane numbers and you have to multiply them normally by 4 to get actual Bytes. You also can add a -r to reset the counters to your perfquery which makes it easier to calculate the per second figures.

Thomas
  • 4,225
  • 5
  • 23
  • 28