Parallelizing divide step and merge step in mergesort algorithm with OpenMP

Question

We wrote a code that parallelizes the divide phase of the mergesort algorithm. I.e. every recursive call is assigned to a thread but our teacher was disappointed because he said we should also parallelize the merge function. I have researched how one can do this and I found these (algorithm 3) lecture notes which changes the complexity of the merge function from O(n) to O(n(log(n)) but which one can now parallelize. I wanted to ask for suggestions how to design the code for the fastest result possible:

(both should compile with g++ /there might be some warnings if one uses pedantic flag but since both yield the correct result one could ignore them)

Old code:

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/time.h>

#include <iostream>
#include <algorithm>

#include <cstdlib>
#include <cstdio>

#include <cmath>
#include <ctime>
#include <cstring>
#include <omp.h>
// Constants.h
#if !defined(MYLIB_CONSTANTS_H)
#define MYLIB_CONSTANTS_H 1

const int CUTOFF =11;

#endif


/**
  * helper routine: check if array is sorted correctly
  */
bool isSorted(int ref[], int data[], const size_t size){
    std::sort(ref, ref + size);
    for (size_t idx = 0; idx < size; ++idx){
        if (ref[idx] != data[idx]) {
            return false;
        }
    }
    return true;
}

/**
  * sequential merge step (straight-forward implementation)
  */
void MsMergeSequential(int *out, int *in, long begin1, long end1, long begin2, long end2, long outBegin) {
    long left = begin1;
    long right = begin2;
    long idx = outBegin;
    
    while (left < end1 && right < end2) {
        if (in[left] <= in[right]) {
            out[idx] = in[left];    
            left++;
        } else {
            out[idx] = in[right];
            right++;
        }
        idx++;
        }
    
    while (left < end1) {
        out[idx] = in[left];
        left++, idx++;
    }
    while (right < end2) {
        out[idx] = in[right];
        right++, idx++;
    }
}
bool myfunc (long i , long j){return (i<j);}
/**
  * sequential MergeSort
  */
void MsSequential(int *array, int *tmp, bool inplace, long begin, long end) {
  if( end <= begin + CUTOFF -1){

    std::sort(array+begin,array + end, myfunc);
  }
  else  if (begin < (end - 1)) {
           long half =(begin+end) / 2;


            #pragma omp taskgroup
         {
           #pragma omp task shared(array) untied if(end-begin >= (1<<15))

             MsSequential(array, tmp, !inplace, begin, half);

             MsSequential(array, tmp, !inplace, half, end);
              }
 if (inplace){
            MsMergeSequential(array, tmp, begin, half, half, end, begin);
 } else {
            MsMergeSequential(tmp, array, begin, half, half, end, begin);
 }
        
    } else if (!inplace) {

        tmp[begin] = array[begin];
    }
}

/**
  * Serial MergeSort
  */
void MsSerial(int *array, int *tmp, const size_t size) {

    MsSequential(array, tmp, true, 0, size);
}

/**

/**
  * @brief program entry point
  */
int main(int argc, char* argv[]) {
    // variables to measure the elapsed time
    struct timeval t1, t2;
    double etime;

    // expect one command line arguments: array size
    if (argc != 2) {
        printf("Usage: MergeSort.exe <array size> \n");
        printf("\n");
        return EXIT_FAILURE;
    }
    else {
        const size_t stSize = strtol(argv[1], NULL, 10);
        int *data = (int*) malloc(stSize * sizeof(int));
        int *tmp = (int*) malloc(stSize * sizeof(int));     
        int *ref = (int*) malloc(stSize * sizeof(int));
        printf("Initialization...\n");

        srand(95);

        #pragma omp parallel for num_threads(100) schedule(static)
        for (size_t idx = 0; idx < stSize; ++idx){
            data[idx] = (int) (stSize * (double(rand()) / RAND_MAX));
        }
        std::copy(data, data + stSize, ref);

        double dSize = (stSize * sizeof(int)) / 1024 / 1024;
        printf("Sorting %zu elements of type int (%f MiB)...\n", stSize, dSize);

        gettimeofday(&t1, NULL);
        #pragma omp parallel num_threads(80) 
        {
        #pragma omp single
        {
        MsSerial(data, tmp, stSize);
        }
        }
        gettimeofday(&t2, NULL);
        etime = (t2.tv_sec - t1.tv_sec) * 1000 + (t2.tv_usec - t1.tv_usec) / 1000;
        etime = etime / 1000;

        printf("done, took %f sec. Verification...", etime);
        if (isSorted(ref, data, stSize)) {
            printf(" successful.\n");
        }
        else {
            printf(" FAILED.\n");
        }

        free(data);
        //delete[] data;
        free(tmp);
        //delete[] tmp;
        free(ref);
        //delete[] ref;
    }

    return EXIT_SUCCESS;
}

New code - merge is parallelizable:

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/time.h>

#include <iostream>
#include <algorithm>

#include <cstdlib>
#include <cstdio>

#include <cmath>
#include <ctime>
#include <cstring>
#include <omp.h>
// Constants.h
#if !defined(MYLIB_CONSTANTS_H)
#define MYLIB_CONSTANTS_H 1



#endif


//Takes a sorted list of size n and a value, puts the value in one of n+1 possible positions,
//if value is same to an element of the list take the position before the first occurence of the same element

int binarysearchfindlowerrank(int *in,int n,int value,int projection){

    int* array= in+projection;
    int L=0;
    int R=n;
    while(R-L>1){
        int middle = (R+L)/2;
        if(array[middle]==value){
            while(array[middle]==value&&middle>0){
                middle=middle-1;
            }
            if(middle==0&&array[middle]>=value){
                return 0;
            }
            else{
            return middle+1;
            }
        }
        if(array[middle]<value){
            L=middle;
        }
        if(array[middle]>value){
            R=middle;
        }
    }
    if(n==1){
        if(array[0]>=value){
            return 0;
        }
        else return 1;
    }
    if(L==0&&array[L]>value){
        return 0;
    }
    if(R==n && array[R-1]< value){
        return n;
    }
    if(R==n&& array[R-1]>=value){
        return R-1;
    }
    if(array[R]<value){
        return R+1;
    }
    if(array[L]<value){
        return R;
    }
    return L;
}


//Takes a sorted list of size n and a value, puts the value in one of n+1 possible positions,
//if value is same to an element of the list take the position after the last occurence of the same element


int binarysearchfinduperrank(int *in,int n,int value, int projection){

    int* array= in+projection;
    int L=0;
    int R=n;
    while(R-L>1){
        int middle = (R+L)/2;
        if(array[middle]==value){
            while(array[middle]==value&&middle<n){
                middle=middle+1;
            }
            return middle;
        }
        if(array[middle]<value){
            L=middle;
        }
        if(array[middle]>value){
            R=middle;
        }
    }
     if(n==1){
         if(array[0]> value){
             return 0;
        }
        else{
            return 1;
        }
    }
    if(L==0&&array[L]>value){
        return 0;
    }
    if(R==n && array[R-1]<= value){
        return n;
    }
    if(R==n&& array[R-1]>value){
        return R-1;
    }
    if(array[R]<=value){
        return R+1;
    }
    if(array[L]<=value){
        return R;
    }
    return L;
}

/**
  * helper routine: check if array is sorted correctly
  */
bool isSorted(int ref[], int data[], const size_t size){
    std::sort(ref, ref + size);
    for (size_t idx = 0; idx < size; ++idx){
        if (ref[idx] != data[idx]) {
            printf("\nFalscher Index:%d\n",idx);
            return false;
        }
    }
    return true;
}

/**
  * sequential merge step (straight-forward implementation)
  */
void MsMergeParallelized(int *out, int *in, long begin1, long end1, long begin2, long end2, long outBegin,int *data,int *tmp) {
    if(begin1==end2){
        out[begin1]=in[begin1];
    }
    if(begin1==begin2||begin2==end2){
        out[begin1+binarysearchfinduperrank(in,1,in[end2],begin1)]=in[end2];
        out[begin1+binarysearchfindlowerrank(in,1,in[begin1],end2)]=in[begin1];
    }
    else{
        for(int i=0;i<(end2-begin2);i++){
            out[begin1+i+binarysearchfinduperrank(in,(end1-begin1),in[begin2+i],begin1)]=in[begin2+i];
        }
        for(int i=0;i<(end1-begin1);i++){
            out[begin1+i+binarysearchfindlowerrank(in,(end2-begin2),in[begin1+i],begin2)]=in[begin1+i];
        }
    }
}
bool myfunc (long i , long j){return (i<j);}
/**
  * sequential MergeSort
  */
void MsParallelized(int *array, int *tmp, bool inplace, long begin, long end) {
  if (begin < (end - 1)) {
        long half =(begin+end) / 2;
        MsParallelized(array, tmp, !inplace, begin, half);
        MsParallelized(array, tmp, !inplace, half, end);
        if (inplace){
            MsMergeParallelized(array, tmp, begin, half, half, end, begin,array,tmp);
        }
        else {
            MsMergeParallelized(tmp, array, begin, half, half, end, begin,array,tmp);
        }
    }
    else if (!inplace) {
        tmp[begin] = array[begin];
    }
}

/**
  * Serial MergeSort
  */
void MsParallel(int *array, int *tmp, const size_t size) {

    MsParallelized(array, tmp, true, 0, size);
}

/**

/**
  * @brief program entry point
  */
int main(int argc, char* argv[]) {
    // variables to measure the elapsed time
    struct timeval t1, t2;
    double etime;

    // expect one command line arguments: array size
    if (argc != 2) {
        printf("Usage: MergeSort.exe <array size> \n");
        printf("\n");
        return EXIT_FAILURE;
    }
    else {
        const size_t stSize = strtol(argv[1], NULL, 10);
        int *data = (int*) malloc(stSize * sizeof(int));
        int *tmp = (int*) malloc(stSize * sizeof(int));
        int *ref = (int*) malloc(stSize * sizeof(int));
        printf("Initialization...\n");

        srand(95);


        for (size_t idx = 0; idx < stSize; ++idx){
            data[idx] = (int) (stSize * (double(rand()) / RAND_MAX));
        }
        std::copy(data, data + stSize, ref);
        double dSize = (stSize * sizeof(int)) / 1024 / 1024;
        printf("Sorting %u elements of type int (%f MiB)...\n", stSize, dSize);
        gettimeofday(&t1, NULL);
        // Mergesort starts
        MsParallel(data, tmp, stSize);

        gettimeofday(&t2, NULL);
        etime = (t2.tv_sec - t1.tv_sec) * 1000 + (t2.tv_usec - t1.tv_usec) / 1000;
        etime = etime / 1000;
        printf("done, took %f sec. Verification...", etime);
        if (isSorted(ref, data, stSize)) {
            printf(" successful.\n");
        }
        else {
            printf(" FAILED.\n");
        }
        free(data);
        //delete[] data;
        free(tmp);
        //delete[] tmp;
        free(ref);
        //delete[] ref;
    }
    return EXIT_SUCCESS;
}

I think https://codereview.stackexchange.com is more suitable for such question. — 273K, Jul 23 '21 at 00:54
Parallel merges are not so good in practice because of the `log(n)` factor and because of the hidden constant in the complexity. Multiway merges can help to mitigate the overhead of the sequential merge. On huge arrays, merging may be bounded by the memory and thus the parallel algorithm will likely be barely faster... possibly slower than a good multiway merge because of the `log(n)` factor... — Jérôme Richard, Jul 23 '21 at 17:10
@rcgldr The best is to use a branchless implementation for multiways merges. However, this is not always possible, nor easy to do (it is at least Ok with native times and the default comparison operator). Note that the number of register of the ABI should not have a huge impact on the result (at least, not on a 4 way case, I think): modern processors use register renaming and have far more registers internally (>100) and can execute instructions in parallel in an out-of-order fashion. Branches are a mess because they are hard to predict and there is only one port for that on current x86 procs. — Jérôme Richard, Jul 23 '21 at 20:34
@rcgldr you can select the min of each pair of element recursively. To get the min of two number you can use *conditional moves*. Alternatively, I think you can conditionally move the index of a way if it is smaller than the previous one so far. Conditional moves do not require any branches on many platforms (including x86 and arm). Due to the instruction dependencies, it is probably wise to work on both side of the stream to merge (it provides more instruction parallelism). A 8-way is probably not faster in sequential but it may be the case in parallel due to the limited memory throughput. — Jérôme Richard, Jul 23 '21 at 23:52
@rcgldr I did not understand your point about switches. I wrote a short partial example [here](https://godbolt.org/z/adM6jMWse) using 4 ways. This implementation is probably not optimal, but it actually do not use any branches (at least is does not require any unpredictable branches). Does it help? — Jérôme Richard, Jul 25 '21 at 14:32
Why 4 pairs of register for a 4-way? The provided code use less registers. Besides this, yes, memory accesses can slow down the code because of the latency in a classic 1 side merge, but with a two side merge it should be better (up to 2x theoretically). I think part of them can be removed. The tests I did in the past on a 2-way 2-side merge (on a i5-9600KF) shows that removing branches results in a significant performance improvement. I did not tested this on a merge with more way since. — Jérôme Richard, Jul 25 '21 at 18:25
@JérômeRichard - My old code for this is C|C++, and the compiler (visual studio) converted the 4 pairs of pointers into registers for X86 X64 release build. I'll need to try your variation to compare results. My code checks for end of run each time it merges an element and updates the corresponding pointer. This shouldn't be an issue for branch prediction, since the code normally loops back for next element and branches away to another loop when 4 way drops to 3 way (and again for 2 way and copy rest of remaining run). — rcgldr, Jul 26 '21 at 21:01
@JérômeRichard - If interested in trying changes, a link to my old example of [4 way bottom up merge sort](https://stackoverflow.com/questions/34844613/34845789#34845789), towards the end of that answer. I deleted most of my prior comments since they aren't needed anymore. — rcgldr, Jul 26 '21 at 21:05

Parallelizing divide step and merge step in mergesort algorithm with OpenMP

0 Answers0