MPI Matrix Multiplication with scatter gather

Question

I'm trying to do matrix multiplication using MPI in C and we have to do a version that sequential and one parallel version. My parallel version is not giving the correct answers and I'm not sure why. I think I'm not sending the right communications to the processes but I can't be sure. The professor just went over the different send/receive/gather etc messages, but didn't really get into much detail... I've seen a lot of different examples but none complete and none using scatter/gather. If anyone can take a look at my code and tell me if anything pops out at them I'd appreciate it. I'm pretty sure my problem is in the scatter/gather messages or the actual calculation of the c matrix.

#define N 512

#include <stdio.h>
#include <math.h>
#include <sys/time.h>
#include <stdlib.h>
#include <stddef.h>
#include "mpi.h"


print_results(char *prompt, float a[N][N]);

int main(int argc, char *argv[])
{
    int i, j, k, rank, size, tag = 99, blksz, sum = 0;
    float a[N][N], b[N][N], c[N][N];
    char *usage = "Usage: %s file\n";
    FILE *fd;
    double elapsed_time, start_time, end_time;
    struct timeval tv1, tv2;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    if (argc < 2) {
            fprintf (stderr, usage, argv[0]);
            return -1;
    }

    if ((fd = fopen (argv[1], "r")) == NULL) {
            fprintf (stderr, "%s: Cannot open file %s for reading.\n",
                                    argv[0], argv[1]);
            fprintf (stderr, usage, argv[0]);
            return -1;
    }

    for (i = 0; i < N; i++)
            for (j = 0; j < N; j++)
                    fscanf (fd, "%f", &a[i][j]);

    for (i = 0; i < N; i++)
            for (j = 0; j < N; j++)
                    fscanf (fd, "%f", &b[i][j]);

    MPI_Barrier(MPI_COMM_WORLD);

    gettimeofday(&tv1, NULL);

    MPI_Scatter(a, N*N/size, MPI_INT, a, N*N/size, MPI_INT, 0,
                                                    MPI_COMM_WORLD);
    MPI_Bcast(b, N*N, MPI_INT, 0, MPI_COMM_WORLD);


    if (rank != 0) {
            for (i = 0; i < N; i++)
            {
                    for (j = 0; j < N; j++)
                    {
                            for (k = 0; k < N; k++)
                            {
                                    sum = sum + a[i][k] * b[k][j];
                            }
                            c[i][j] = sum;
                            sum = 0;
                    }
            }
    }

    MPI_Gather(c, N*N/size, MPI_INT, c, N*N/size, MPI_INT, 0,
                                                     MPI_COMM_WORLD);
    MPI_Finalize();

    gettimeofday(&tv2, NULL);

    elapsed_time = (tv2.tv_sec - tv1.tv_sec) + ((tv2.tv_usec - tv1.tv_usec)/1000000.0);

    printf ("elapsed_time=\t%lf (seconds)\n", elapsed_time);

    print_results("C = ", c);
}

print_results(char *prompt, float a[N][N])
{
    int i, j;

    printf ("\n\n%s\n", prompt);
    for (i = 0; i < N; i++) {
            for (j = 0; j < N; j++) {
                    printf(" %.2f", a[i][j]);
            }
            printf ("\n");
    }
    printf ("\n\n");
}

updated part of code:

for (i=0;i<size; i++)
            {
                    if (rank == i)
                    {
                            for (i = rank*(N/size); i < (rank*(N/size)+(N/size)); i++)
                            {
                                    for (j = rank*(N/size); j < (rank*(N/size)+(N/size)); j++)
                                    {
                                            for (k = rank*N; k < rank*N+N; k++)
                                            {
                                                    sum = sum + a[i][k] * b[k][j];
                                            }
                                            c[i][j] = sum;
                                            sum = 0;
                                    }
                            }
                    }
            }

One thing that jumps out: the root should supply the magic value `MPI_IN_PLACE` for `recvbuf` in scatter and gather. Just supplying the same pointer for `sendbuf` and `recvbuf` technically violates the spec. Also, `MPI_Finalize` can take a long time. You should probably not be timing it. — Greg Inozemtsev, Sep 17 '12 at 05:47
Sorry, `MPI_IN_PLACE` should be supplied for `sendbuf` in `MPI_Gather`. Also, the root participates in scatter and gather, but then you exclude it from doing the computation. It should participate, just like the other ranks. — Greg Inozemtsev, Sep 17 '12 at 05:55
Besides what Greg Inozemtsev and Francesco have already spotted, your computational kernel loops over the entire matrix `a` and not only over the part that resides in the memory of the current rank. The range of the `i` loop should be restricted accordingly. Also MPI provides the `MPI_Wtime()` timer function which has the same precision as `gettimeofday` if not even higher (as the MPI standard advises the implementors to use the highest resolution timer available). — Hristo Iliev, Sep 17 '12 at 11:08

François Févotte · Accepted Answer · 2012-09-17T06:35:43.320

3

A first problem in your code is that size might not divide N. Which means that scattering size packets of length N*N/size does not necessarily send the whole matrix. This is probably the hardest point to get right.

As Greg Inozemtsev points out, a second problem is that you exclude process 0 from the computation, although it is responsible for a part of the matrix.

And yet another problem is that all I/O operations (reading the coefficients at the beginning and outputting the results at the end) should be done only by process 0.

On another note, you should specify the return type (void in this case) of your print_result function, both in the forward declaration and in the definition.

edited Sep 17 '12 at 06:35

answered Sep 17 '12 at 06:19

François Févotte

19,520
4
51
74

I think the only thing now I'm confused about is how to make each processor only work on the parts of the matrix they are responsible for... – user939287 Sep 18 '12 at 15:32
1

Think about which loop in the original version you're parallelizing. You've distributed the matrix lines among all processes, which means you are parallelizing the loop over `i`. Now each process only needs to work on its `N/size` lines. (Again, that works only if `size` divides `N`.) – François Févotte Sep 19 '12 at 19:10
well, N is defined as 512 and we're only using 1, 4, 8, 16, and 32 processors so it'll always be even. i added some code after taking your suggestion (i put it in my orginal post, above) and I thought I had it figured out but whenever I run it with more than one processor I get a segfault "mpirun noticed that process rank 2 with PID 30215 on node compute-1-2.local exited on signal 11 (Segmentation fault)" – user939287 Sep 21 '12 at 20:12
You have to understand exactly what MPI_Scatter does: it takes the original, big vector on process 0, splits it, and sends each chunk to another process. Each receiving process only holds its chunk, indexed from 0. So that all processes run the exact same `for` loop with `i` ranging from 0 to `N/size`. – François Févotte Sep 24 '12 at 06:24

MPI Matrix Multiplication with scatter gather

1 Answers1

Linked