OpenMPI + Infiniband performance

Question

I'm new to HPC and I am curious about a point regarding the performance of MPI over Infiniband. For reference, I am using OpenMPI over two different machines connected through IB.

I've coded a very simple benchmark to see how fast I can transfer data over IB using MPI calls. Below you can see the code.

The issue is that when I run this, I get a throughput of ~1.4 gigabytes/s. However, when I use standard ib benchmarks like ib_write_bw, I get nearly 6 GB/s. What might account for this sizable discrepancy? Am I being naive about Gather, or is this just a result of OpenMPI overheads that I can't overcome?

In addition to the code, I am providing a plot to show the results of my simple benchmark.

Thanks in advance!

Results:

Code:

#include<iostream>
#include<mpi.h>
#include <stdint.h>
#include <ctime>
using namespace std;


void server(unsigned int size, unsigned int n) {
 uint8_t* recv = new uint8_t[size * n];
 uint8_t* send = new uint8_t[size];
 std::clock_t s = std::clock();
 MPI_Gather(send, size, MPI_CHAR, recv, size, MPI_CHAR, 0, MPI_COMM_WORLD);
 std::clock_t e = std::clock();
 cout<<size<<" "<<(e - s)/double(CLOCKS_PER_SEC)<<endl;
 delete [] recv;
 delete [] send;
}

void client(unsigned int size, unsigned int n) {
 uint8_t* send = new uint8_t[size];
 MPI_Gather(send, size, MPI_CHAR, NULL, 0, MPI_CHAR, 0, MPI_COMM_WORLD);
 delete [] send;
}

int main(int argc, char **argv) {
 int ierr, size, rank;
 MPI_Init(&argc, &argv);
 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
 MPI_Comm_size(MPI_COMM_WORLD, &size);
 cout<<"Rank "<<rank<<" of "<<size<<endl;
 unsigned int min = 1, max = (1 << 31), n = 1000;
 for (unsigned int i = 1; i < n; i++) {
  unsigned int s = i * ((max - min) / n);
  if(rank == 0) server(s, size); else client(s, size);
 }

 MPI_Finalize();
}

Have you tried using another established benchmark to verify these performance issues are not related to the code? If you have access to Intel Compilers and MPI the Intel MPI Benchmarks are a great tool to verify performance issues https://software.intel.com/en-us/articles/intel-mpi-benchmarks. Did you vary load size in your benchmark or use a single size, what about warm-up, how many times did you run it? IMB-PingPong will run a variety of load sizes in many iterations to test bandwidth and throughput after warming up. — Matt, Oct 15 '16 at 03:32

score 1 · Answer 1 · answered Oct 17 '16 at 22:21

In your code you are executing a single collective operation per message size. This involves huge overhead in comparison with tests that were written for performance measurement (e.g. ib_write_bw). In general, comparing MPI collectives to ib_write_bw is not apples to apples comparison:

RDMA opcode
- ib_write_bw uses RDMA_WRITE operations, which doesn't use CPU at all - once the initial handshake is done, it is pure RDMA, constrained only by network and PCIe capabilities.
- MPI will use different RDMA opcodes for different collectives and different message sizes, and if you do it as you did in your code, there are lots of things that MPI does for each message (hence the huge overhead)
Data overhead
- ib_write_bw transfers almost pure data (there's a local routing header and a payload)
- MPI has more data (headers) added to each packet to allow the receiver to identify the message
Zero copy
- ib_write_bw is doing what is called "zero-copy" - data is sent from a user buffer directly, and written to a user buffer directly on the receiving side, w/o copying from/to buffers
- MPI will copy the message from your client's buffer to its internal buffers on the sender side, then copy it again from its internal buffers on the receiving side to your server's buffer. Again, this behaviour depends on the message size and MPI configuration and MPI implementation, but you get the general idea.
Memory registration
- ib_write_bw registers the required memory region and exchanges this info between client and server before starting measuring performance
- If MPI will need to register some memory region during the collective execution, it will do it while you are measuring time
there are many more
- even the "small" things like warming up the cache lines on the HCAs...

So, now that we've covered why you shouldn't compare these things, here's what you should do:

There are two libraries that are regarded as a de-facto standard for MPI performance measurement:

IMB (Intel MPI Benchmark) - it says Intel, but it is written as a standard MPI application and will work with any MPI implementation.
OSU benchmarks - again, it says MVAPICH, but it will work with any MPI.

Download those, compile with your MPI, run your benchmarks, see what you get. This is as high as you can get with MPI. If you get much better results than with your small program (and you will for sure) - this is open source, see how the pros are doin it :)

Have fun!

score 0 · Answer 2 · answered Oct 14 '16 at 23:38

0

You have to consider that the full payload size for the collective call received on rank 0 depends on the number of ranks. So with, say, 4 processes sending 1000 bytes you actually receive 4000 bytes on the root rank. That includes a memory copy from rank 0's input buffer into the output buffer (possibly with a detour through the network stack). And that is before you add the overheads of MPI and the lower networking protocols.

answered Oct 14 '16 at 23:38

dabo42

486
3
6

The plot I posted is for a two-node system, so you argument, which is true in general, shouldn't affect the present case. – Joshua Gevirtz Oct 14 '16 at 23:57

OpenMPI + Infiniband performance

2 Answers2