MPI with C slower if more processes are used

Question

I am learning MPI with C and I wrote a code based on the one presented in this link: http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml.

In this code a vector containing 1e8 values are summed. However, I am observing that when using more processes the run time is getting bigger. The code is given bellow:

/*
Based on the code presented at http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml

Code which splits a vector and send information to other processes.
In case of main vector does not split equally to all processes, the leftover is passed to process id 1.
Process id 0 is the root process. Therefore it does not count while passing information.

Each process will calculate the partial sum of vector values and send it back to root process, which will calculate the total sum.
Since the processes are independent, the printing order will be different at each run.

compile as: mpicc -o vector_sum vector_send.c -lm
run as: time mpirun -n x vector_sum

x = number of splits desired + root process. For example: if * = 3, the vector will be splited in two.
*/

#include<stdio.h>
#include<mpi.h>
#include<math.h>

#define vec_len 100000000
double vec1[vec_len];
double vec2[vec_len];

int main(int argc, char* argv[]){
    // defining program variables
    int i;
    double sum, partial_sum;

    // defining parallel step variables
    int my_id, num_proc, ierr, an_id, root_process; // id of process and total number of processes
    int num_2_send, num_2_recv, start_point, vec_size, rows_per_proc, leftover;

    ierr = MPI_Init(&argc, &argv);

    root_process = 0;

    ierr = MPI_Comm_size(MPI_COMM_WORLD, &num_proc);
    ierr = MPI_Comm_rank(MPI_COMM_WORLD, &my_id);

    if(my_id == root_process){
        // Root process: Define vector size, how to split vector and send information to workers
        vec_size = 1e8; // size of main vector

        for(i = 0; i < vec_size; i++){
            //vec1[i] = pow(-1.0,i+2)/(2.0*(i+1)-1.0); // defining main vector...  Correct answer for total sum = 0.78539816339
            vec1[i] = pow(i,2)+1.0; // defining main vector... 
            //printf("Main vector position %d: %f\n", i, vec1[i]); // uncomment if youwhish to print the main vector
        }

        rows_per_proc = vec_size / (num_proc - 1); // average values per process: using (num_proc - 1) because proc 0 does not count as a worker.
        rows_per_proc = floor(rows_per_proc); // getting the maximum integer possible.
        leftover = vec_size - (num_proc - 1)*rows_per_proc; // counting the leftover.

        // spliting and sending the values
        
        for(an_id = 1; an_id < num_proc; an_id++){
            if(an_id == 1){ // worker id 1 will have more values if there is any leftover.
                num_2_send = rows_per_proc + leftover; // counting the amount of data to be sent.
                start_point = (an_id - 1)*num_2_send; // defining initial position in the main vector (data will be sent from here)
            }
            else{
                num_2_send = rows_per_proc;
                start_point = (an_id - 1)*num_2_send + leftover; // starting point for other processes if there is leftover.
            }
            
            ierr = MPI_Send(&num_2_send, 1, MPI_INT, an_id, 1234, MPI_COMM_WORLD); // sending the information of how many data is going to workers.
            ierr = MPI_Send(&vec1[start_point], num_2_send, MPI_DOUBLE, an_id, 1234, MPI_COMM_WORLD); // sending pieces of the main vector.
        }

        sum = 0;
        for(an_id = 1; an_id < num_proc; an_id++){
            ierr = MPI_Recv(&partial_sum, 1, MPI_DOUBLE, an_id, 4321, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving partial sum.
            sum = sum + partial_sum;
        }

        printf("Total sum = %f.\n", sum);

    }
    else{
        // Workers:define which operation will be carried out by each one
        ierr = MPI_Recv(&num_2_recv, 1, MPI_INT, root_process, 1234, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving the information of how many data worker must expect.
        ierr = MPI_Recv(&vec2, num_2_recv, MPI_DOUBLE, root_process, 1234, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving main vector pieces.
        
        partial_sum = 0;
        for(i=0; i < num_2_recv; i++){
            //printf("Position %d from worker id %d: %d\n", i, my_id, vec2[i]); // uncomment if youwhish to print position, id and value of splitted vector
            partial_sum = partial_sum + vec2[i];
        }

        printf("Partial sum of %d: %f\n",my_id, partial_sum);

        ierr = MPI_Send(&partial_sum, 1, MPI_DOUBLE, root_process, 4321, MPI_COMM_WORLD); // sending partial sum to root process.
        
    }

    ierr = MPI_Finalize();
    
}

Obs.: Compile as


mpicc -o vector_sum vector_send.c -lm

and run as:

time mpirun -n x vector_sum

with x = 2 and 5. You will see that with x=5 it takes more time to run.

Did I do something wrong? I did not expected it to be slower, since the summation of each chunk is independent. Or it is a matter of how the program is sending the information for each process? It seems to me that the loops for sending the information for each process is the responsible for this longer time.

your program basically statters the data, perform partial sums and reduce it. though a partial sum is faster than a full sum, the scatter/reduce operations (communications) can be an important overhead that may increase the overall elapsed time. Instead of scattering `vec1` into `vec2`, you can directly initialize `vec2` on all the nodes and get rid of `vec1`. — Gilles Gouaillardet, Sep 20 '22 at 22:49
Hi @GillesGouaillardet! Thank you very much for your answer. I am new to MPI programming, so let me check if I understood: do you mean define the vector entries for different values of i on each node instead of define them on the root process? — Felipe_SC, Sep 21 '22 at 02:45
yes. instead of populating `vec1` and then scattering it (into `vec2`), you should get rid of `vec1` and directly have each node populates `vec2`. — Gilles Gouaillardet, Sep 21 '22 at 04:11

score 0 · Answer 1 · answered Sep 21 '22 at 12:28

As suggested by Gilles Gouaillardet (https://stackoverflow.com/users/8062491/gilles-gouaillardet): I modified the code to generate the vector pieces in each process instead of passing them from the root process. It worked! Now the elapsed time is smaller for more processes. I am posting the new code bellow:

/*
Based on the code presented at http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml

Code which calculate the sum of a vector using parallel computation.
In case of main vector does not split equally to all processes, the leftover is passed to process id 1.
Process id 0 is the root process. Therefore it does not count while passing information.

Each process will generate and calculate the partial sum of the vector values and send it back to the root process, which will calculate the total sum.
Since the processes are independent, the printing order will be different at each run.

compile as: mpicc -o vector_sum vector_send.c -lm
run as: time mpirun -n x vector_sum

x = number of splits desired + root process. For example: if * = 3, the vector will be splited in two.

Acknowledgements: I would like to thanks Gilles Gouaillardet (https://stackoverflow.com/users/8062491/gilles-gouaillardet) for the helpful suggestion.
*/

#include<stdio.h>
#include<mpi.h>
#include<math.h>

#define vec_len 100000000
double vec2[vec_len];

int main(int argc, char* argv[]){
    // defining program variables
    int i;
    double sum, partial_sum;

    // defining parallel step variables
    int my_id, num_proc, ierr, an_id, root_process; // id of process and total number of processes
    int vec_size, rows_per_proc, leftover, num_2_gen, start_point;

    ierr = MPI_Init(&argc, &argv);

    root_process = 0;

    ierr = MPI_Comm_size(MPI_COMM_WORLD, &num_proc);
    ierr = MPI_Comm_rank(MPI_COMM_WORLD, &my_id);

    if(my_id == root_process){

        vec_size = 1e8; // defining main vector size

        rows_per_proc = vec_size / (num_proc - 1); // average values per process: using (num_proc - 1) because proc 0 does not count as a worker.
        rows_per_proc = floor(rows_per_proc); // getting the maximum integer possible.
        leftover = vec_size - (num_proc - 1)*rows_per_proc; // counting the leftover.

        // defining the number of data and position corresponding to main vector
        
        for(an_id = 1; an_id < num_proc; an_id++){
            if(an_id == 1){ // worker id 1 will have more values if there is any leftover.
                num_2_gen = rows_per_proc + leftover; // counting the amount of data to be generated.
                start_point = (an_id - 1)*num_2_gen; // defining corresponding initial position in the main vector.
            }
            else{
                num_2_gen = rows_per_proc;
                start_point = (an_id - 1)*num_2_gen + leftover; // defining corresponding initial position in the main vector for other processes if there is leftover.
            }

            ierr = MPI_Send(&num_2_gen, 1, MPI_INT, an_id, 1234, MPI_COMM_WORLD); // sending the information of how many data must be generated.
            ierr = MPI_Send(&start_point, 1, MPI_INT, an_id, 1234, MPI_COMM_WORLD); // sending the information of initial positions on main vector.
        }
        
        
        sum = 0;
        for(an_id = 1; an_id < num_proc; an_id++){
            ierr = MPI_Recv(&partial_sum, 1, MPI_DOUBLE, an_id, 4321, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving partial sum.
            sum = sum + partial_sum;
        }

        printf("Total sum = %f.\n", sum);
        
    }
    else{
        ierr = MPI_Recv(&num_2_gen, 1, MPI_INT, root_process, 1234, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving the information of how many data worker must generate.
        ierr = MPI_Recv(&start_point, 1, MPI_INT, root_process, 1234, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving the information of initial positions.

        // generate and sum vector pieces
        partial_sum = 0;
        for(i = start_point; i < start_point + num_2_gen; i++){
            vec2[i] = pow(i,2)+1.0;
            partial_sum = partial_sum + vec2[i];
        }

        printf("Partial sum of %d: %f\n",my_id, partial_sum);

        ierr = MPI_Send(&partial_sum, 1, MPI_DOUBLE, root_process, 4321, MPI_COMM_WORLD); // sending partial sum to root process.
               
    }

    ierr = MPI_Finalize();
    return 0;
    
}

In this new version, instead of passing the main vector pieces, it is passed the just the information of how generate those pieces in each process.

sounds good! not you do not even need send `num_2_gen` nor `start_point` since they can be computed on all the ranks. Also, it is a common practice that rank `0` also has its own `vec2` so it computes its fair share instead of simply waiting for the other ranks to send their partial sum. Last but not least, this is a great opportunity to learn (and use) `MPI_Reduce()`. — Gilles Gouaillardet, Sep 21 '22 at 12:34
These are very interesting suggestions! I have done some search and found a code to compute mean and standard deviation (https://github.com/mpitutorial/mpitutorial/blob/gh-pages/tutorials/mpi-reduce-and-allreduce/code/reduce_stddev.c) in which the author uses MPI_Allreduce( ) and MPI_Reduce( ). It seems that in this case one does not have to send information from one process to another (am I right?). I will try to improve more this code and post here any advances. — Felipe_SC, Sep 21 '22 at 13:22
Another question. Is it possible to implement a parallel computing step as a function? For example a code that has to perform the integration of a function in some steps: the quadratures could be implemented as a function and just in this part use the MPI? I do not know if this would be the best way for implementing this. — Felipe_SC, Sep 21 '22 at 13:28
An MPI application typically starts all the processes at the beginning. That would mean non root processes would sit idling most of the time. That is doable but highly suboptimal. — Gilles Gouaillardet, Sep 21 '22 at 14:11
Right! One of my next steps will be to write a program which uses a sequece of integrations. Thus, I will have the oportunity to work more on this idea. About the previous code. I manage to use MPI_Reduce( ) in it. The performance have improved and simplified the code. I will post the new code here as an answer. — Felipe_SC, Sep 21 '22 at 14:21

Felipe_SC · Accepted Answer · 2022-09-21T17:15:46.483

The new code using MPI_Reduce() is faster and simpler than the previous one:

/*
Based on the code presented at http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml

Code which calculate the sum of a vector using parallel computation.
In case of main vector does not split equally to all processes, the leftover is passed to process id 0.
Process id 0 is the root process. However, it will also perform part of calculations.

Each process will generate and calculate the partial sum of the vector values. It will be used MPI_Reduce() to calculate the total sum.
Since the processes are independent, the printing order will be different at each run.

compile as: mpicc -o vector_sum vector_sum.c -lm
run as: time mpirun -n x vector_sum

x = number of splits desired + root process. For example: if x = 3, the vector will be splited in two.

Acknowledgements: I would like to thanks Gilles Gouaillardet (https://stackoverflow.com/users/8062491/gilles-gouaillardet) for the helpful suggestion.
*/

#include<stdio.h>
#include<mpi.h>
#include<math.h>

#define vec_len 100000000
double vec2[vec_len];

int main(int argc, char* argv[]){
    // defining program variables
    int i;
    double sum, partial_sum;

    // defining parallel step variables
    int my_id, num_proc, ierr, an_id, root_process;
    int vec_size, rows_per_proc, leftover, num_2_gen, start_point;

    vec_size = 1e8; // defining the main vector size
    
    ierr = MPI_Init(&argc, &argv);

    root_process = 0;

    ierr = MPI_Comm_size(MPI_COMM_WORLD, &num_proc);
    ierr = MPI_Comm_rank(MPI_COMM_WORLD, &my_id);

    rows_per_proc = vec_size/num_proc; // getting the number of elements for each process.
    rows_per_proc = floor(rows_per_proc); // getting the maximum integer possible.
    leftover = vec_size - num_proc*rows_per_proc; // counting the leftover.

    if(my_id == 0){
        num_2_gen = rows_per_proc + leftover; // if there is leftover, it is calculate in process 0
        start_point = my_id*num_2_gen; // the corresponding position on the main vector
    }
    else{
        num_2_gen = rows_per_proc;
        start_point = my_id*num_2_gen + leftover; // the corresponding position on the main vector
    }

    partial_sum = 0;
    for(i = start_point; i < start_point + num_2_gen; i++){
        vec2[i] = pow(i,2) + 1.0; // defining vector values
        partial_sum += vec2[i]; // calculating partial sum
    }

    printf("Partial sum of process id %d: %f.\n", my_id, partial_sum);

    MPI_Reduce(&partial_sum, &sum, 1, MPI_DOUBLE, MPI_SUM, root_process, MPI_COMM_WORLD); // calculating total sum

    if(my_id == root_process){
        printf("Total sum is %f.\n", sum);
    }

    ierr = MPI_Finalize();
    return 0;
    
}

MPI with C slower if more processes are used

2 Answers2