Error in parallelization MPI_Allgather

Question

EDITED IN LIGHT OF THE COMMENTS

I am learning MPI and I am doing some exercises to understand some aspects of it. I have written a code that should perform a simple Monte-Carlo.

There are two main loops in it that have to be accomplished: one on the time steps T and a smaller one inside this one on the number of molecules N. So after I attempt to move every molecule the program goes to the next time step.

I tried to parallelize it by dividing the operations on the molecules on the different processors. Unfortunately the code, which works for 1 processor, prints the wrong results for total_E when p>1. The problem probably lies in the following function and more precisely is given by a call to MPI_Allgather(local_r,n,MPI_DOUBLE,r,n,MPI_DOUBLE,MPI_COMM_WORLD);

I completely don't understand why. What am I doing wrong? (besides a primitive parallelization strategy)

My logic was that for every time step I could calculate the moves on the molecules on the different processors. Unfortunately, while I work with the local vectors local_r on the various processors, to calculate the energy difference local_DE, I need the global vector r since the energy of the i-th molecule depends on all the others. Therefore I thought to call MPI_Allgather since I have to update the global vector as well as the local ones.

void Step(double (*H)(double,double),double* local_r,double* r,double *E_,int n,int my_rank){

    int i;
    double* local_rt = calloc(n,sizeof(double));
    double local_DE;

    for(i=0;i<n;i++){

        local_rt[i] = local_r[i] + delta*((double)lrand48()/RAND_MAX-0.5);
            local_rt[i] = periodic(local_rt[i]);

            local_DE = E_single(H,local_rt,r,i,n,my_rank) - E_single(H,local_r,r,i,n,my_rank);

        if ( local_DE <= 0.0 || exp(-local_DE) > (double) lrand48()/RAND_MAX  ) {               
            (*E_) += local_DE;
            local_r[i] = local_rt[i];

        }
MPI_Allgather(local_r,n,MPI_DOUBLE,r,n,MPI_DOUBLE,MPI_COMM_WORLD);

    }


    return ;

}

Here it is the complete "working" code:

#define _XOPEN_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <mpi.h>

#define N 100
#define L 5.0
#define T_ 5000
#define delta 2.0


void Step(double (*)(double,double),double*,double*,double*,int,int);
double H(double ,double );
double E(double (*)(double,double),double* ,double*,int ,int );
double E_single(double (*)(double,double),double* ,double*,int ,int ,int);
double * pos_ini(void);
double periodic(double );
double dist(double , double );
double sign(double );

int main(int argc,char** argv){

    if (argc < 2) {
        printf("./program <outfile>\n");
        exit(-1);
    }

    srand48(0);
        int my_rank;
    int p;  
    FILE* outfile = fopen(argv[1],"w");
    MPI_Init(&argc,&argv);
    MPI_Comm_rank(MPI_COMM_WORLD,&my_rank);
    MPI_Comm_size(MPI_COMM_WORLD,&p);
    double total_E,E_;
    int n;
    n = N/p;
    int t;
    double * r =  calloc(N,sizeof(double)),*local_r = calloc(n,sizeof(double));

    for(t = 0;t<=T_;t++){
        if(t ==0){
            r = pos_ini();
            MPI_Scatter(r,n,MPI_DOUBLE, local_r,n,MPI_DOUBLE, 0, MPI_COMM_WORLD);
                E_ = E(H,local_r,r,n,my_rank);
        }else{
            Step(H,local_r,r,&E_,n,my_rank);
        }

        total_E = 0;
        MPI_Allreduce(&E_,&total_E,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD);

            if(my_rank == 0){
                fprintf(outfile,"%d\t%lf\n",t,total_E/N);
            }

    }

    MPI_Finalize();

    return 0;

}



double sign(double a){

    if(a < 0){
        return -1.0 ;
    }else{
        return  1.0 ;
    }

}

double periodic(double a){

    if(sqrt(a*a) > L/2.0){
        a = a - sign(a)*L;
    }

    return a;
}


double dist(double a, double b){

    double d = a-b;
    d = periodic(d);

    return sqrt(d*d);
}

double * pos_ini(void){

  double * r  = calloc(N,sizeof(double));
  int i;

  for(i = 0;i<N;i++){
    r[i] =  ((double) lrand48()/RAND_MAX)*L - L/2.0;
  }

  return r;

}

double H(double a,double b){

      if(dist(a,b)<2.0){
        return  exp(-dist(a,b)*dist(a,b))/dist(a,b);
    }else{

    return 0.0;

    }
}



double E(double (*H)(double,double),double* local_r,double*r,int n,int my_rank){

    double local_V = 0;
    int i;

    for(i = 0;i<n;i++){
             local_V += E_single(H,local_r,r,i,n,my_rank);
     }
    local_V *= 0.5;

    return local_V; 
}

double E_single(double (*H)(double,double),double* local_r,double*r,int i,int n,int my_rank){

    double local_V = 0;
    int j;

      for(j = 0;j<N;j++){

        if( (i + n*my_rank) != j ){
            local_V+=H(local_r[i],r[j]);
        }

    }

    return local_V; 
}


void Step(double (*H)(double,double),double* local_r,double* r,double *E_,int n,int my_rank){

    int i;
    double* local_rt = calloc(n,sizeof(double));
    double local_DE;

    for(i=0;i<n;i++){

        local_rt[i] = local_r[i] + delta*((double)lrand48()/RAND_MAX-0.5);
            local_rt[i] = periodic(local_rt[i]);

            local_DE = E_single(H,local_rt,r,i,n,my_rank) - E_single(H,local_r,r,i,n,my_rank);

        if ( local_DE <= 0.0 || exp(-local_DE) > (double) lrand48()/RAND_MAX  ) {               
            (*E_) += local_DE;
            local_r[i] = local_rt[i];

        }
MPI_Allgather(local_r,n,MPI_DOUBLE,r,n,MPI_DOUBLE,MPI_COMM_WORLD);

    }


    return ;

}

A conditional `MPI_Allgather`? Wouldn't that block your program in case it is not invoked in all the processes? — Frederick Zhang, May 11 '17 at 13:35
You are right taking it out of the if removes the block, unortunately the program still doesn't work — Fra, May 11 '17 at 13:41
when p changes the value of total_E changes (It becomes clear if ones changes the seed of the random number generator to a fixed number), when p = 2 it is still similar to p = 1 but when p =4 it is radically different — Fra, May 11 '17 at 13:46
the value at t = 0 is always correct on the other hand, so the problem is still in Step — Fra, May 11 '17 at 13:47
I didn't really get your chemical or physical stuff but if you would like to perform gathering over an unpredictable part of all the processes, you could simply let the rest of the processes send some dummy data. A little wasteful but straightforward. — Frederick Zhang, May 11 '17 at 13:53
I am no expert, I'm not sure I understood, could you please expand in an answer? — Fra, May 11 '17 at 13:56
@Fra *when p =4 it is radically different*: What numbers do you get for `total_E` for different `p`? Also, please edit the code to change the seed of the random number generator to a fixed number to verify the results. — Shibli, May 11 '17 at 14:16
I've done the edit. For instance, total_E at t = 1 : p = 1 total_E = 85.489044, p = 2 total_E = 84.927762, p = 4 total_E = 72.490657 — Fra, May 11 '17 at 14:19
You realise that because each rank generates its own pseudo-random sequence, you cannot expect the configurations and therefore the energies after one step to be the same, right? Run the simulation for many steps and compare the average values. — Hristo Iliev, May 11 '17 at 15:06

score 2 · Accepted Answer · answered May 11 '17 at 16:10

You cannot expect to get the same energy given different number of MPI processes for one simple reason - the configurations generated are very different depending on how many processes are there. The reason is not MPI_Allgather, but the way the Monte-Carlo sweeps are performed.

Given one process, you attempt to move atom 1, then atom 2, then atom 3, and so on, until you reach atom N. Each attempt sees the configuration resulting from the previous one, which is fine.

Given two processes, you attempt to move atom 1 while at the same time attempting to move atom N/2. Neither atom 1 sees the eventual displacement of atom N/2 nor the other way round, but then atoms 2 and N/2+1 see the displacement of both atom 1 and atom N/2. You end up with two partial configurations that you simply merge with the all-gather. This is not equivalent to the previous case when a single process does all the MC attempts. The same applies for the case of more than two processes.

There is another source of difference - the pseudo-random number (PRN) sequence. The sequence produced by the repeated calls to lrand48() in one process is not the same as the combined sequence produced by multiple independent calls to lrand48() in different processes, therefore even if you sequentialise the trials, still the acceptance will differ due to the locally different PRN sequences.

Forget about the specific values of the energy produced after each step. In a proper MC simulation those are insignificant. What counts is the average value over a large number of steps. Those should be the same (within a certain margin proportional to 1/sqrt(N)) no matter the update algorithm used.

Does this mean that this code is inherently serial? After all, atoms should be treated sequentially. — Shibli, May 11 '17 at 17:09
MCMC is inherently serial, but there are methods that allow one to run multiple Markov chains and periodically exchange information between them with the end result being very similar to what a sequential MCMC gives. See [here](https://arxiv.org/pdf/1312.7479v1.pdf) for a detailed description of one technique or [here](https://www2.stat.duke.edu/~scs/Parallel.pdf) for the simplified version. — Hristo Iliev, May 11 '17 at 19:23

Frederick Zhang · Answer 2 · 2017-05-11T14:27:44.750

It's been quite long since the last time I used MPI but it seems that your program halts when you try to "gather" and update the data in several of all the processes and it is unpredictable that which processes would need to do the gathering.

So in this case a simple solution is to let the rest of the processes send some dummy data so they could simply be ignore by others. For instance,

if ( local_DE <= 0.0 || exp(-local_DE) > (double) lrand48()/RAND_MAX  ) {               
    (*E_) += local_DE;
    local_r[i] = local_rt[i];
    MPI_Allgather(local_r,n,MPI_DOUBLE,r,n,MPI_DOUBLE,MPI_COMM_WORLD);
    // filter out the dummy data out of "r" here
} else {
    MPI_Allgather(dummy_sendbuf, n, MPI_DOUBLE, dummy_recvbuf, n, MPI_DOUBLE, MPI_COMM_WORLD);
}

Dummy data could be some exceptional wrong numbers which should not be in the results, so other processes could filter them out.

But as I mentioned, this is quite wasteful as you don't really need to receive that much data from all processes and we would like to avoid it especially when there're quite a lot of data to send.

In this case, you can gather some "flags" from other processes so that we could know which processes own data to send.

// pseudo codes
// for example, place 1 at local_flags[my_rank] if it's got data to send, otherwise 0
MPI_Allgather(local_flags, n, MPI_BYTE, recv_flags, n, MPI_BYTE, MPI_COMM_WORLD)
// so now all the processes know which processes will send
// receive data from those processes
MPI_Allgatherv(...)

I remember with MPI_Allgatherv, you could specify the number of elements to receive from a specific process. Here's an example: http://mpi.deino.net/mpi_functions/MPI_Allgatherv.html

But bear in mind this might be an overkill if the program is not well parallelized. For example, in your case, this is placed inside a loop, so those processes without data still need to wait for the next gathering of the flags.

@Fra Then I suggest you test some simpler examples to see whether there're any issues with your environment at first place. — Frederick Zhang, May 11 '17 at 14:59

score 0 · Answer 3 · answered May 11 '17 at 16:08

You should take MPI_Allgather() outside for loop. I tested with the following code but note that I modified the lines involving RAND_MAX in order to get consistent results. As a result, the code gives the same answer for number of processors 1, 2, and 4.

void Step(double (*H)(double,double),double* local_r,double* r,double *E_,int n,int my_rank){

    int i;
    double* local_rt = calloc(n,sizeof(double));
    double local_DE;

    for(i=0;i<n;i++){

        //local_rt[i] = local_r[i] + delta*((double)lrand48()/RAND_MAX-0.5);
        local_rt[i] = local_r[i] + delta*((double)lrand48()-0.5);
        local_rt[i] = periodic(local_rt[i]);

        local_DE = E_single(H,local_rt,r,i,n,my_rank) - E_single(H,local_r,r,i,n,my_rank);

        //if ( local_DE <= 0.0 || exp(-local_DE) > (double) lrand48()/RAND_MAX  )
        if ( local_DE <= 0.0 || exp(-local_DE) > (double) lrand48()  )
        {
            (*E_) += local_DE;
            local_r[i] = local_rt[i];
        }

    }

    MPI_Allgather(local_r,n,MPI_DOUBLE,r,n,MPI_DOUBLE,MPI_COMM_WORLD);

    return ;
}

This changes the physics and produces a completely different ensemble as atom `i+1` does not see the new position of atom `i`. — Hristo Iliev, May 11 '17 at 16:17

Error in parallelization MPI_Allgather

3 Answers3