C++ MPI, using multiple node, first reduce at node level, then reduce to the head node

Question

I use 12 nodes windows HPC cluster (each with 24 cores) to run a C++ MPI program (use Boost MPI). One run with the MPI reduce, one comment out MPI reduce (for speed test only). The run time is 01:17:23 and 01:03:49. It seems to me that MPI reduce take a large portion of time. I think it might be worthy to try to first reduce at node level, then reduce to the head node to improve performance.

Below is a simple example for test purpose. Suppose there is 4 computer nodes, each has 2 cores. I want to first use mpi to reduce on each node. After that, reduce to the head node. I am not quite familiar with mpi and the below program crashes.

#include <iostream>
#include <boost/mpi.hpp>
namespace mpi = boost::mpi;
using namespace std;

int main()
{
  mpi::environment env;
  mpi::communicator world;

  int i = world.rank();


  boost::mpi::communicator local = world.split(world.rank()/2); // total 8 cores, divide in 4 groups
  boost::mpi::communicator heads = world.split(world.rank()%4);

  int res = 0;

  boost::mpi::reduce(local, i, res, std::plus<int>(), 0);
  if(world.rank()%2==0)
  cout<<res<<endl;
  boost::mpi::reduce(heads, res, res, std::plus<int>(), 0);

  if(world.rank()==0)
      cout<<res<<endl;

  return 0;
}

The output is illegible, something like this

Z
h
h
h
h
a
a
a
a
n
n
n
n
g
g
g
g
\
\
\
\
b
b
b
b
o
o
o
o
o
o
o
o
s
...
...
...

The error message is

Test.exe ended prematurely and may have crashed. exit code 3

I suspect I did something wrong with the group split/or reduce but cannot figure it out with several trials.How do I change to make this work? Thanks.

Why? For what reason? Also, please always provide a [mcve] and a clearer problem description. — Zulan, Jul 01 '19 at 22:36
do you want to re-invent `MPI_Reduce()` or `MPI_Allreduce()` ? Generally speaking, you should let the MPI library optimize that for you (hierarchical algo is not always the most efficient algorithm). In MPI, you would use different send and recv buffers or `MPI_IN_PLACE`, not sure that applies to Boost.MPI though. — Gilles Gouaillardet, Jul 02 '19 at 00:01
@Zulan, as mentioned in the edited version, this is just an example. In my real program, I run it on 12 nodes cluster (each with 24 cores). The reduce phase is slow. That is the reason. The above code will compile but crashed if you have boost mpi, I need to figure out why. For plain MPI code, I do not know how to write it. — user11594134, Jul 02 '19 at 11:52
@GillesGouaillardet, that is a great point. MPI_Reduce might already do that optimization. This might turn out to be even slower, I think I still want to try out to make sure. — user11594134, Jul 02 '19 at 11:53
“reduce is very slow” might mean there is some imbalance and the root task has to (implicitly) wait for the others. You should add barriers and timers to figure out if the time is spent in the barriers(e.g. imbalance) or the communications (e.g. slow reduce). — Gilles Gouaillardet, Jul 02 '19 at 12:25
@GillesGouaillardet, you are exactly right. There is certainly a imbalance in my program and it is a big problem. I have not figure out a way to solve that and have to put that on hold. The logic behind this idea is of two stage reduce is like this: the reduce part in the real program is not trivial (complex class, a lot of add, divide), it might be time-saving for the node who already finish all the task to do the reduce. When the final reduce comes, the head just need to reduce 1 for this node instead of 24. As you have suggested, this might turn out even worse. Just want to make sure. — user11594134, Jul 02 '19 at 12:40
Trying to optimize a performance issue whose bottleneck do not fully understand is generally a bad idea. I would suggest to use a proper MPI-aware performance analysis tool **first**. Besides that, the code snippet is not a [mcve] - it does **not** compile and you should always present your own attempts of debugging. Also try to explain the reasoning of your code? Which ranks participate in the two-stage reduction? — Zulan, Jul 02 '19 at 12:54
@Zulan per your suggestion, the question has been edited. I am not quite familiar with MPI performance analysis tool. Do you have any recommendations? thanks — user11594134, Jul 02 '19 at 13:51
I have no idea about windows, but a search engine will give you fine directions towards MPI performance analysis tools. If you are working on a HPC site, ask the operator. Thanks for updating the question but note that the code does still not compile (if you are still curious as to why the code crashes). — Zulan, Jul 02 '19 at 17:33
@Zulan, do you have MPI and boost mpi installed? If so, that should not be any problem. What is your compiling error? — user11594134, Jul 02 '19 at 18:30
Headers and the main function are missing. Any compiler should tell you if you tried to tell the actual code from this question. — Zulan, Jul 03 '19 at 17:13
@Zulan, the whole code is posted. Let us see if you still have compiling errors. — user11594134, Jul 03 '19 at 17:20

score 1 · Accepted Answer · answered Jul 03 '19 at 17:59

The reason for the cash is because you pass the same variable twice to MPI in the following line

boost::mpi::reduce(heads, res, res, std::plus<int>(), 0);

That's not quite well documented in Boost.MPI, but boost takes these by reference and passes the respective pointers to MPI. MPI in general forbids you to pass the same buffer twice to the same call. To be precise an output buffer passed to an MPI function must not alias (overlap) to any other buffer passed in this call.

You can easily fix this by creating a copy of res.

I also think you probably want to restrict calling the second reduce from the processes with local.rank() == 0.

Also reiterating the comment - I doubt you will get any benefit from re-implementing a reduction. Trying to optimize a performance issue whose bottleneck do not fully understand is generally a bad idea.

this solves the problem. I really learn a lot from this. Thanks so much — user11594134, Jul 03 '19 at 19:00

C++ MPI, using multiple node, first reduce at node level, then reduce to the head node

1 Answers1