I have a multi-processing program using Visual C++ and Boost MPI. Each process do it parts and in the end, process 0 gather all the results and summarize. Below is an excerpt of the code (poolsummary is class using Boost serialization)
if(rank == 0){
vector<poolsummary> ps_;
vector<poolsummary> ps2_;
gather(world, ps, ps_, 0);
gather(world, ps2, ps2_, 0);
for(int i = 1; i < size;i++){
ps_[0].updateFromPool(ps_[i]);
ps2_[0].updateFromPool(ps2_[i]);
}
ps_[0].Save_file(asp.SCENARIO_PATH);
ps2_[0].Save_file2(asp.SCENARIO_PATH);
vector<poolsummary>().swap(ps_);
vector<poolsummary>().swap(ps2_);
}else{
gather(world, ps, 0);
gather(world, ps2, 0);
}
The program still need to gather two additional classes (let us call them hist and rep).
Usually I run this program using 64 processors and there is a long tail for this gather part. I think two ways might be able to improve the performance 1. Using non-blocking gather or something 2. group the processes into 8 group (e.g. process 0 - 7 as group 1, process 8 - 15 as group 2 ...); Then first do a gather within each group, then gather groups
Could someone help me on if these solutions will work? If not, what are some possible ways to improve the performance? Is so, how to implement these two? Thanks so much for your time.