2

I am studying OpenMP and I have some questions that I believe will clear up my thoughts.

I have a small example of a matrix multiplication A*B where A,B,C are global variables. I know how we can parallelize the for loops one at a time or both together with collapse, but my question is :

Ιn which loop if I use #pragma omp for should I overlook the critical section in check1 where we need to add there because C is a global variable and also in which loop should I use the keyword nowait to avoid barrier in the loop because I know #pragma omp for it has it automatically. When I am trying to program this nested for loop am making this : my_approach

int i,j,sum;
for(int i=0;i<N;i++) # loop1
    for(j=0;j<N;j++){ #loop2
         for(k=sum=0;k<N;k++) #loop3
             sum += A[i][j]*B[k][J]
         C[i][j] = sum  # check1
     };

my_approach

#pragma omp parallel num_threads(4)
{
    #pragma omp for schedule(static) nowait // **one**
     for(int i=0;i<N;i++) # loop1
        for(j=0;j<N;j++){ #loop2
             for(k=sum=0;k<N;k++) #loop3
                 sum += A[i][j]*B[k][J]
             #pragma omp critical //  **two**
             C[i][j] = sum  # check1
         };
}
  1. one : I put "nowait" there because code runs faster with that , I dont know the reason or if I am making the right decision
  2. two : I use critical section thinking of how I would have builded it with threads.

So lets say that this is right what about with parallizing second for loop or third do i need those things or not ? If someone can explain to me when I need to add critical section or nowait if I parallelize this nested for loops one at a time I would appreciate!

dreamcrash
  • 47,137
  • 25
  • 94
  • 117
gregni
  • 417
  • 3
  • 12

1 Answers1

3

In your example neither do you need the nowait nor the critical:

 #pragma omp parallel for schedule(static) num_threads(4) // **one**
 for(int i=0;i<N;i++) # loop1
    for(j=0;j<N;j++){ #loop2
         for(k=0;k < N;k++) #loop3
             C[i][j] += A[i][j]*B[k][J]
  

There is no race-condition, during the updates to the global matrix C, each thread updates a different position of that matrix. You have however a different race-condition, namely during the updates of the variables j and k, since both are shared among the threads, to fix this race-condition just make them private, for instance, as follows:

 #pragma omp parallel for schedule(static) num_threads(4) // **one**
 for(int i=0;i<N;i++) # loop1
    for(int j=0;j<N;j++){ #loop2
         for(int k=0;k < N;k++) #loop3
             C[i][j] += A[i][j]*B[k][J]

one : I put "nowait" there because code runs faster with that , I dont know the reason or if I am making the right decision

Well you should not just blindly remove it without a good reason for it. Here you can safely removed because 1) between the parallel for and the parallel region there is no code being used, 2) the parallel region has also an implicit barrier.

#pragma omp parallel num_threads(4)
{
    #pragma omp for schedule(static) nowait // **one**
     ...
   // There no code here
} // <-- implicit barrier

Notwithstanding, in your case, you can just merge both pragmas into one and removing the nowait clause:

#pragma omp parallel for schedule(static) num_threads(4)

So let's say that this is right what about with parallelizing the second for loop or third do I need those things or not ?

The generic answer is depends. It depends on too many factors, but typically you should start by the outermost loops, because those are the ones that will produce tasks with the highest granularity, but again this will depend from case to case. Nonetheless, in your concrete example you can parallelize the outermost loop.

You can try to parallelize nested loops to see if you have any gain in performance. Tested with parallelizing the outer most loop against parallelizing the first two outer most loops, and check if you obtain any performance gain.

dreamcrash
  • 47,137
  • 25
  • 94
  • 117