Crummy performance with MPI

Question

I'm learning MPI and have a question about almost no performance gain in the simple implementation below.

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main(int argc, char **argv)
{
        int mpirank, mpisize;
        int tabsize = atoi(*(argv + 1));

        MPI_Init(&argc, &argv);
        MPI_Comm_rank(MPI_COMM_WORLD, &mpirank);
        MPI_Comm_size(MPI_COMM_WORLD, &mpisize);

        unsigned long int sum = 0;
        int rcvsize = tabsize / mpisize;
        int *rcvbuf = malloc(rcvsize * sizeof(int));
        int *tab = malloc(tabsize * sizeof(int));
        int totalsum = 0;

        if(mpirank == 0){
            for(int i=0; i < tabsize; i++){
               *(tab + i) = 1;
            }
        }
        MPI_Scatter(tab, tabsize/mpisize, MPI_INT, rcvbuf, tabsize/mpisize, MPI_INT, 0, MPI_COMM_WORLD);

        for(int i=0; i < tabsize/mpisize; i++){
                sum += *(rcvbuf + i);
        }

        MPI_Reduce(&sum, &totalsum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);

        if(mpirank == 0){
                printf("The totalsum = %li\n", totalsum);
        }

        MPI_Finalize();

        return 0;
}

Execution times of above implementation are:

$ /usr/bin/time mpirun -np 1 test1 2000000000 
The totalsum = 2000000000
13.76user 3.31system 0:17.30elapsed 98%CPU (0avgtext+0avgdata 15629824maxresident)k 0inputs+8outputs (0major+21720minor)pagefaults 0swaps
$ /usr/bin/time mpirun -np 1 test1 2000000000
The totalsum = 2000000000
13.78user 3.29system 0:17.31elapsed 98%CPU (0avgtext+0avgdata 15629824maxresident)k 0inputs+8outputs (0major+21717minor)pagefaults 0swaps
$ /usr/bin/time mpirun -np 1 test1 2000000000
The totalsum = 2000000000
13.78user 3.32system 0:17.33elapsed 98%CPU (0avgtext+0avgdata 15629828maxresident)k 0inputs+8outputs (0major+20697minor)pagefaults 0swaps

$ /usr/bin/time mpirun -np 20 test1 2000000000
The totalsum = 2000000000
218.42user 6.10system 0:12.99elapsed 1727%CPU (0avgtext+0avgdata 8209484maxresident)k 0inputs+17400outputs (118major+82587minor)pagefaults 0swaps
$ /usr/bin/time mpirun -np 20 test1 2000000000
The totalsum = 2000000000
216.17user 6.37system 0:12.89elapsed 1726%CPU (0avgtext+0avgdata 8209488maxresident)k 0inputs+17168outputs (126major+81092minor)pagefaults 0swaps
$ /usr/bin/time mpirun -np 20 test1 2000000000
The totalsum = 2000000000
216.16user 6.09system 0:12.88elapsed 1724%CPU (0avgtext+0avgdata 8209492maxresident)k 0inputs+17192outputs (111major+81665minor)pagefaults 0swaps

Which gives only about 25% performance gain. My guess here is that the bottleneck may be caused by processes that compete to access the memory. Then I tried same but without using memory to get to data.

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main(int argc, char **argv)
{
        int mpirank, mpisize;
        int tabsize = atoi(*(argv + 1));

        MPI_Init(&argc, &argv);
        MPI_Comm_rank(MPI_COMM_WORLD, &mpirank);
        MPI_Comm_size(MPI_COMM_WORLD, &mpisize);

        unsigned long int sum = 0;

        for(int i=0; i < tabsize/mpisize; i++){
                sum += 1;
        }

        MPI_Reduce(&sum, &totalsum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);

        if(mpirank == 0){
                printf("The totalsum = %li\n", totalsum);
        }

        MPI_Finalize();

        return 0;
}

which gave the following results:

$ /usr/bin/time mpirun -np 1 test2 2000000000  
The totalsum = 2000000000
6.17user 0.11system 0:06.49elapsed 96%CPU (0avgtext+0avgdata 5660maxresident)k 0inputs+8outputs (0major+4005minor)pagefaults 0swaps 
$ /usr/bin/time mpirun -np 1 test2 2000000000 
The totalsum = 2000000000
6.16user 0.12system 0:06.49elapsed 96%CPU (0avgtext+0avgdata 5660maxresident)k 0inputs+8outputs (0major+4007minor)pagefaults 0swaps 
$ /usr/bin/time mpirun -np 1 test2 2000000000 
The totalsum = 2000000000
6.15user 0.11system 0:06.47elapsed 96%CPU (0avgtext+0avgdata 5664maxresident)k 0inputs+8outputs (0major+4005minor)pagefaults 0swaps

$ /usr/bin/time mpirun -np 20 test2 2000000000
The totalsum = 2000000000
8.67user 2.41system 0:01.06elapsed 1040%CPU (0avgtext+0avgdata 6020maxresident)k 0inputs+16824outputs (128major+49952minor)pagefaults 0swaps
$ /usr/bin/time mpirun -np 20 test2 2000000000
The totalsum = 2000000000
8.59user 2.74system 0:01.05elapsed 1076%CPU (0avgtext+0avgdata 6028maxresident)k 0inputs+16792outputs (131major+49960minor)pagefaults 0swaps
$ /usr/bin/time mpirun -np 20 test2 2000000000
The totalsum = 2000000000
8.65user 2.61system 0:01.06elapsed 1058%CPU (0avgtext+0avgdata 6024maxresident)k 0inputs+16792outputs (116major+50002minor)pagefaults 0swaps

This shows about 83% performance gain and would confirm my guesses. Then could you tell me if my guesses are correct and if yes are there any ways to improve the first implementation with memory access?

The code has been run on machine with 20 physical cores.

EDIT1: additional results of the first implementation for 2, 5 and 10 processes:

$ /usr/bin/time mpirun -np 2 test1 2000000000
The totalsum = 2000000000
24.05user 3.40system 0:14.03elapsed 195%CPU (0avgtext+0avgdata 11724552maxresident)k 0inputs+960outputs (6major+23195minor)pagefaults 0swaps

$ /usr/bin/time mpirun -np 5 test1 2000000000
The totalsum = 2000000000
55.27user 3.54system 0:12.88elapsed 456%CPU (0avgtext+0avgdata 9381132maxresident)k 0inputs+4512outputs (26major+31614minor)pagefaults 0swaps

$ /usr/bin/time mpirun -np 10 test1 2000000000
The totalsum = 2000000000
106.43user 4.07system 0:12.44elapsed 887%CPU (0avgtext+0avgdata 8599952maxresident)k 0inputs+8720outputs (51major+50059minor)pagefaults 0swaps

EDIT2:

I have put MPI_Wtime() to measure MPI_Scatter part of the first implementation as follows:

...
                for(int i=0; i < tabsize; i++){
                        *(tab + i) = 1;
                }
        }

        MPI_Barrier(MPI_COMM_WORLD);
        double start = MPI_Wtime();

        MPI_Scatter(tab, tabsize/mpisize, MPI_INT, rcvbuf, tabsize/mpisize, MPI_INT, 0, MPI_COMM_WORLD);

        MPI_Barrier(MPI_COMM_WORLD);
        double end = MPI_Wtime();

        for(int i=0; i < tabsize/mpisize; i++){
                sum += *(rcvbuf + i);
...

and got the following results:

$ /usr/bin/time mpirun -np 1 test1 400000000
The MPI_Scatter time = 0.576 (14% of total)
3.13user 0.74system 0:04.08elapsed 95%CPU 
$ /usr/bin/time mpirun -np 2 test1 400000000
The MPI_Scatter time = 0.580 (18% of total)
5.19user 0.79system 0:03.25elapsed 183%CPU 
$ /usr/bin/time mpirun -np 4 test1 400000000
The MPI_Scatter time = 0.693 (22.5% of total)
9.99user 1.05system 0:03.07elapsed 360%CPU
$ /usr/bin/time mpirun -np 5 test1 400000000
The MPI_Scatter time = 0.669 (22.3% of total)
12.41user 1.01system 0:03.00elapsed 446%CPU 
$ /usr/bin/time mpirun -np 8 test1 400000000
The MPI_Scatter time = 0.696 (23.7% of total)
19.67user 1.25system 0:02.95elapsed 709%CPU 
$ /usr/bin/time mpirun -np 10 test1 400000000
The MPI_Scatter time = 0.701 (24% of total)
24.21user 1.45system 0:02.92elapsed 876%CPU

$ /usr/bin/time mpirun -np 1 test1 1000000000
The MPI_Scatter time = 1.434 (15% of total)
7.64user 1.71system 0:09.57elapsed 97%CPU
$ /usr/bin/time mpirun -np 2 test1 1000000000
The MPI_Scatter time = 1.441 (19% of total)
12.72user 1.75system 0:07.52elapsed 192%CPU 
$ /usr/bin/time mpirun -np 4 test1 1000000000
The MPI_Scatter time = 1.710 (25% of total)
24.16user 1.93system 0:06.84elapsed 381%CPU
$ /usr/bin/time mpirun -np 5 test1 1000000000
The MPI_Scatter time = 1.675 (25% of total)
30.29user 2.10system 0:06.81elapsed 475%CPU 
$ /usr/bin/time mpirun -np 10 test1 1000000000
The MPI_Scatter time = 1.753 (26.6% of total)
59.89user 2.47system 0:06.60elapsed 943%CPU

$ /usr/bin/time mpirun -np 10 test1 100000000
The MPI_Scatter time = 0.182 (15.8% of total)
6.75user 1.07system 0:01.15elapsed 679%CPU 
$ /usr/bin/time mpirun -np 10 test1 200000000
The MPI_Scatter time = 0.354 (20% of total)
12.50user 1.12system 0:01.71elapsed 796%CPU 
$ /usr/bin/time mpirun -np 10 test1 300000000
The MPI_Scatter time = 0.533 (22.8% of total)
18.54user 1.30system 0:02.33elapsed 849%CPU
$ /usr/bin/time mpirun -np 10 test1 400000000
The MPI_Scatter time = 0.702 (23.95% of total)
24.38user 1.37system 0:02.93elapsed 879%CPU 
$ /usr/bin/time mpirun -np 10 test1 1000000000
The MPI_Scatter time = 1.762 (26% of total)
60.17user 2.42system 0:06.62elapsed 944%CPU

dreamcrash · Accepted Answer · 2021-01-03T16:20:06.270

Which gives only about 25% performance gain. My guess here is that the bottleneck may be caused by processes that compete to access the memory. (..)

Your code is mostly communication- and CPU- bound. Moreover, according to your results for 2, 5, and 10 processes:

 $ /usr/bin/time mpirun -np 2 test1 2000000000
The totalsum = 2000000000
24.05user 3.40system 0:14.03elapsed 195%CPU (0avgtext+0avgdata 11724552maxresident)k 0inputs+960outputs (6major+23195minor)pagefaults 0swaps

$ /usr/bin/time mpirun -np 5 test1 2000000000
The totalsum = 2000000000
55.27user 3.54system 0:12.88elapsed 456%CPU (0avgtext+0avgdata 9381132maxresident)k 0inputs+4512outputs (26major+31614minor)pagefaults 0swaps

$ /usr/bin/time mpirun -np 10 test1 2000000000
The totalsum = 2000000000
106.43user 4.07system 0:12.44elapsed 887%CPU (0avgtext+0avgdata 8599952maxresident)k 0inputs+8720outputs (51major+50059minor)pagefaults 0swaps

The code stops scaling already at around five processes, unlikely (at this point) for the memory bound-width to be saturated.

Then I tried same but without using memory to get to data. (..) This shows about 83% performance gain and would confirm my guesses.

But you also removed the MPI_Scatter call. Consequently, reducing the communication overhead, while keeping basically the same amount of work to be performed in parallel.

I have profiled your code in my machine (2 physical cores; 4 logical). To measures the times, I am using MPI_Wtime(); as follows:

int main(int argc, char **argv)
{
        int mpirank, mpisize;
        int tabsize = atoi(*(argv + 1));

        MPI_Init(&argc, &argv);
        MPI_Comm_rank(MPI_COMM_WORLD, &mpirank);
        MPI_Comm_size(MPI_COMM_WORLD, &mpisize);

        MPI_Barrier(MPI_COMM_WORLD);
        double start = MPI_Wtime();
        ...
                if(mpirank == 0){
                printf("The totalsum = %li\n", totalsum);
        }
        MPI_Barrier(MPI_COMM_WORLD);
        double end = MPI_Wtime();
        if(mpirank == 0)
          printf("Time:%f\n",end-start);
}

for an input equal to yours (i.e., 2000000000) the results were:

1 process : 25.158740 seconds
2 processes : 19.116490 seconds
4 processes : 15.971734 seconds

An improvement of around 40% and the memory-hierarchy of my machine should be much inferior to a machine with 20 physical cores.

Let us now significantly reduce the input size, hence reducing the memory footprint, from 2000000000 (8 gigabytes) to just 250000000 (1 gigabytes), and retest again:

1 process : 1.312354 seconds
2 processes : 1.229174 seconds
4 processes : 1.232522 seconds

An improvement of around 6%; If the bottleneck was the processes competing for memory, I would not expect such a reduction in speedups after reducing the memory footprint. Nonetheless, this reduction can be easily explained by the fact that by reducing the input size I increased the ratio of communication per computation.

Let us go back to the tests with 2000000000 elements but this time measuring the time spent on the MPI_Scatter communication routine (the one that you have removed):

2 processes : 7.487354 seconds
4 processes : 8.728969 seconds

As one can see with 2 and 4 processes, approximately 40% (i.e., 7.487354/ 19.116490) and 54% (i.e., 8.728969/ 15.971734) of the application execution time was spent on the MPI_Scatter alone, respectively. That is why, when you removed that routine, you have noticed an improvement in speedup.

Now the same test for the input 250000000 (1 gigabytes):

2 processes ::0.679913 seconds (55% of the time)
4 processes : 0.691987 seconds (56% of the time)

As you can see, even with a smaller memory footprint, the overhead of the MPI_scatter remained percentage-wise around the same (for 4 processes). The conclusion is that the more processes, the less computation per process, and consequently, the higher is the ratio of communication per computation -- excluding other overheads that might popup with a higher number of processes running. Moreover, in your code, with more processes the memory usage does not grow linear, except for the main process (that contains the entire data) the reaming processes will have the data scatter among them.

Typically, a good MPI_scatter implementation, will have a time complexity of O(n log p), with n being the size of the input and p the number of processes. Therefore, the overhead of the MPI_scatter will increase faster by increasing the input size then by increasing the number of processes involved on that communication. However, by increasing the input size you have more computation per process being performed in parallel, whereas if you increase the number of processes you will have less computation per process being performed.

Bear in mind, however, that the tests that I have performed are not the most accurate ever, because of the environment that I am running, my MPI implementation might differ from yours, and so on. Nonetheless, I am confident that if you perform the same tests on your setup, you will draw the same conclusions.

Crummy performance with MPI

1 Answers1