mpi application performace get worse when adding more process

Question

i write parallel binary search algorithm with MPI it works as expected in term of searching for a value but when the -n is 1 (serial) the total time is much lower than any value above that like 2, 4, 8, etc....

when i increase number of process it take longer time when i expect the time to be lower than the 1 process. what is the problem or can anyone help me solve this. Here is my code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <mpi.h>

void BinarySearch(int local_x[], int search, int lower, int heigher, int rank, int comm_sz);
int* create_array(int n);

int index = -1;
int found_rank = -1;

void main() {
    int search = 7;
    
    MPI_Init(NULL, NULL);

    int my_rank, comm_sz;
    double start = 0;
    double finish = 0;
    MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

    int size =  10 * comm_sz;
    int* x = nullptr;

    int local_size = (size / comm_sz); //+ 1 
    int* local_x = new int[local_size];

    MPI_Barrier(MPI_COMM_WORLD);
    start = MPI_Wtime();
    if (my_rank == 0)
    {
        printf("size per process = %d \n", local_size);
        x = create_array(size);
        for (int i = 0; i < size; i++)
        {
            printf("%d ", x[i]);
        }

        //  printf("\n Please input number: \n");
        //getchar();
    }

    MPI_Scatter(x, local_size, MPI_INT, local_x, local_size, MPI_INT, 0, MPI_COMM_WORLD);

    BinarySearch(local_x, search, 0, local_size, my_rank, comm_sz);


    MPI_Barrier(MPI_COMM_WORLD);
    finish = MPI_Wtime();


    if (my_rank == 0) {
        printf("\n total time = %g \n", finish - start);

        if (found_rank == -1 || index == -1) {
            printf("Not found");
        }else
        printf("\n value %d located at index %d at rank %d \n", search, index,found_rank );

    }


    MPI_Finalize();
}

void BinarySearch(int local_x[], int search, int lower, int heigher, int rank, int comm_sz) {
    int mid = -1;
    int size = heigher;
    int correct_rank = -1;
    for (int i = lower; lower < heigher ; i++)
    {
        mid = (lower + heigher) / 2;

        if (local_x[mid] > search) {

            heigher = mid - 1;
        }

        if (local_x[mid] < search) {
            lower = (mid + 1);
        }

        if (local_x[mid] == search) {
            break;
        }

    }
    int value = local_x[mid];
    if (value == search) {
        mid = (rank * size) + mid;
        correct_rank = rank;
    }
    else {
        mid = -1;
    }

    MPI_Reduce(&mid, &index, 1, MPI_INT, MPI_MAX, 0, MPI_COMM_WORLD);
    MPI_Reduce(&correct_rank, &found_rank, 1, MPI_INT, MPI_MAX, 0, MPI_COMM_WORLD);




}

int* create_array(int n) {
    int* tmp = (int*)calloc(n, sizeof(int));
    for (int i = 0; i < n; i++)
    {
        tmp[i] = i + 1 ;
    }
    return (tmp);
}

Your problem is extremely small so your runtime is completely dominated by process overhead. `int size = 10 * comm_sz;` Try a million instead of `10`. — Victor Eijkhout, May 28 '23 at 13:55
@VictorEijkhout Yeah i thought about this and increase the problem size to the edge of my laptop memory but with no luck, here results for a million: `mpiexec -n 1 Study1.exe size per process = 1000000 total time = 0.0004621 value 500000 located at index 499999 at rank 0` `mpiexec -n 2 Study1.exe size per process = 500000 total time = 0.0006074 value 500000 located at index 499999 at rank 0` `mpiexec -n 4 Study1.exe size per process = 250000 total time = 0.0010217 value 500000 located at index 499998 at rank 1` — baha, May 28 '23 at 18:11
Those times are still very low. Also, do you seriously have a bunch of `printf` in your timed loop? Get rid of all of that. Also remove the array creation out of the loop since that's probably not parallel. — Victor Eijkhout, May 29 '23 at 03:28
@VictorEijkhout those `printf` is for debugging only i comment them out and rebuild when run the actual test and i changed the start time location before the `MPI_Scatter`. array creation is serial correct but it is not in the loop it gets created at rank 0 then scattered to each process. I tried to create 1gb array of `int` the time get up to 1 sec on `-n 1` and ~2 sec on `-n 2`. — baha, May 29 '23 at 09:29
@VictorEijkhout my question now becomes is this problem need like above 16GB array size to see the benefits of parallelism?. Time complexity for this algorithm is `O(logn)`, I think that to see any benefit from parallelize this problem the n must me very large number that my laptop memory can't handle. **Correct me if I'm wrong** — baha, May 29 '23 at 09:40
1. Your `create_array` call is in between the two timers. 2. No, you don't need a monstrous array. But the conclusion is that doing exactly one binary search is not worth parallelizing. 3. To eliminate overhead: Try doing 1000 searches: loop around your current code. — Victor Eijkhout, May 29 '23 at 17:27
@VictorEijkhout Thank You, Wrapping the `BinarySearch` call with a for loop and moving the `MPI_Reduce` outeside the function solved the problem. it goes down from 5 seconds on 1 process to 0.7 seconds on 128 process :) — baha, May 29 '23 at 18:57

mpi application performace get worse when adding more process

0 Answers0