1

I'm trying to parallelize this piece of code that search for a max on a column. The problem is that the parallelize version runs slower than the serial

Probably the search of the pivot (max on a column) is slower due the syncrhonization on the maximum value and the index, right?

int i,j,t,k;
    // Decrease the dimension of a factor 1 and iterate each time
for (i=0, j=0; i < rwA, j < cwA; i++, j++) {
    int i_max = i; // max index set as i
    double matrixA_maxCw_value = fabs(matrixA[i_max][j]);
    #pragma omp parallel for reduction(max:matrixA_maxCw_value,i_max) //OVERHEAD
    for (t = i+1; t < rwA; t++) {
        if (fabs(matrixA[t][j]) > matrixA_maxCw_value) {
            matrixA_maxCw_value = matrixA[t][j];
            i_max = t;
        }
    }
    if (matrixA[i_max][j] == 0) {
        j++; //Check if there is a pivot in the column, if not pass to the next column
    }
    else {
        //Swap the rows, of A, L and P
        #pragma omp parallel for //OVERHEAD
        for (k = 0; k < cwA; k++) {
            swapRows(matrixA, i, k, i_max);
            swapRows(P, i, k, i_max);
            if(k < i) {
                swapRows(L, i, k, i_max);
            }
        }
        lupFactorization(matrixA,L,i,j,rwA);
    }
}

void swapRows(double **matrixA, int i, int j, int i_max) {
    double temp_val = matrixA[i][j];
    matrixA[i][j] = matrixA[i_max][j];
    matrixA[i_max][j] = temp_val;   
}

I do not want a different code but I want only know why this happens, on a matrix of dimension 1000x1000 the serial version takes 4.1s and the parallelized version 4.28s

The same thing (the overhead is very small but there is) happens on the swap of the rows that theoretically can be done in parallel without problem, why it happens?

Fabio
  • 336
  • 3
  • 17

1 Answers1

1

There is at least two things wrong with your parallelization

#pragma omp parallel for reduction(max:matrixA_maxCw_value,i_max) //OVERHEAD
for (t = i+1; t < rwA; t++) {
    if (fabs(matrixA[t][j]) > matrixA_maxCw_value) {
        matrixA_maxCw_value = matrixA[t][j];
        i_max = t;
    }
}

You are getting the biggest index of all of them, but that does not mean that it belongs to the max value. For instance looking at the following array:

[8, 7, 6, 5, 4 ,3, 2 , 1]

if you parallelized with two threads, the first thread will have max=8 and index=0, the second thread will have max=4 and index=4. After the reduction is done the max will be 8 but the index will be 4 which is obviously wrong.

OpenMP has in-build reduction functions that consider a single target value, however in your case you want to reduce taking into account 2 values the max and the array index. After OpenMP 4.0 one can create its own reduction functions (i.e., User-Defined Reduction).

You can have a look at a full example implementing such logic here

The other issue is this part:

   #pragma omp parallel for //OVERHEAD
    for (k = 0; k < cwA; k++) {
        swapRows(matrixA, i, k, i_max);
        swapRows(P, i, k, i_max);
        if(k < i) {
            swapRows(L, i, k, i_max);
        }
    }

You are swapping those elements in parallel, which leads to inconsistent state.

First you need to solve those issue before analyzing why your code is not having speedups.

First correctness then efficiency. But don't except much speedups with the current implementation, the amount of computation performed in parallelism is that much to justify the overhead of the parallelism.

dreamcrash
  • 47,137
  • 25
  • 94
  • 117