Choices for shared memory system, MPI library, original RDMA or ULP over RDMA?

Question

I am new on High Performance Computing (HPC), but I am going to have a HPC project, so I need some help to solve some fundamental problems.

The application scenario is simple: Several servers connected by the InfiniBand (IB) network, one server for Master, and others for slaves. only the master read/write in-memory data (size of the data ranges from 1KB to several hundred MBs) into slaves, while slaves just passively store the data in their memories ( and dump the in-memory data into disks at the right time). All computation are are performed in the Master, before the writing or after the reading the data to/from the slaves. The requirement of the system is low latency (small regions of data, such as 1KB-16KB) and high throughput (large regions of data, several hundred MBs).

So, My questions are

1. Which concrete way is more suitable for us? MPI, primitive IB/RDMA library or ULPs over RDMA.

As far as I know, existing Message Passing Interface (MPI) library, primitive IB/RDMA library such as libverbs and librdmacm and User Level Protocal (ULPs) over RDMA might be feasible choices, but I am not very sure of their applicable scopes.

2. Should I make some tunings for the OS or the IB network for better performance?

There is a paper [1] from Microsoft announces that

We improved performance by up to a factor of eight with careful tuning and changes to the operating system and the NIC drive

For my part, I will try to avoid such performance tuning as I can. However, if the tuning is unavoidable, I will try my best. The IB network of our environment is Mellanox InfiniBand QDR 40Gb/s and I can choose the Linux OS for servers freely.

If you have any ideas, comments and answers are welcome! Thanks in advance!

[1] FaRM: Fast Remote Memory

I'm voting to close this question as primarily opinion-based. Each of the listed technologies can be used to achieve one or another of the specified objectives, although at a different price in terms of ease of use and code maintainability. And without intricate knowledge of the network profile of your application, I doubt anyone could answer the second question. — Hristo Iliev, May 28 '15 at 11:04
@HristoIliev Thanks for your comment. I would like to clarify my questions: First, which is the easy method to achieve my application's requirements? Second, how can I avoid the tuning work by choosing a mature library which just fits my requirements (just like fast remote `memcpy`). — foool, May 28 '15 at 13:47

score 1 · Answer 1 · answered May 22 '15 at 15:32

If you use MPI, you will have the benefit of an interconnect-independent solution. It doesn't sound like this is going to be something you are going to keep around for 20 years, but software lasts longer than you ever think it will.

Using MPI also gives you the benefit of being able to debug on your (oversusbscribed, possibly) laptop or workstation before rolling it out onto the infiniband machines.

As to your second question about tuning the network, I am sure there is no end of tuning you can do but until you have some real workloads and hard numbers, you're wasting your time. Get things working first, then worry about optimizing the network. Maybe you need to tune for many tiny packages. Perhaps you need to worry about a few large transfers. The tuning will be pretty different depending on the case.

Choices for shared memory system, MPI library, original RDMA or ULP over RDMA?

1 Answers1