Pragma omp parallel for inside inner loop is not correctly ignored in nested loop

Question

I'm trying to implement the following codes to see how OpenMP threads are managed over the nested loop where each inner/outer loops are separately implemented in a function and its caller.

Each loop is implemented with the statement #pragma omp parallel for and I'm assuming the pragma for the inner loop is ignored.

To see this, I printed the thread number in each loop.

Then, what I could see is the following, where the thread id in the inner loop is always zero not identical to the thread number corresponding to the caller. Why does this happen?

Calling 0 from 0
Calling 2 from 1
Calling 6 from 4
Calling 8 from 6
Calling 4 from 2
Calling 7 from 5
Calling 5 from 3
    Calling 0 from 0  // Expecting 3
    Calling 1 from 0
    Calling 2 from 0
    Calling 3 from 0
    Calling 0 from 0
    Calling 1 from 0
    Calling 2 from 0
    Calling 3 from 0
    Calling 0 from 0
    Calling 0 from 0
    Calling 0 from 0
    Calling 1 from 0
    Calling 2 from 0
    Calling 3 from 0
Calling 9 from 7
    Calling 1 from 0 // Expecting 7
    Calling 2 from 0
    Calling 3 from 0
    Calling 0 from 0
Calling 3 from 1
    Calling 0 from 0 // Expecting 1
    Calling 1 from 0
    Calling 2 from 0
    Calling 1 from 0
    Calling 2 from 0
    Calling 3 from 0
    Calling 3 from 0
    Calling 1 from 0
    Calling 2 from 0
    Calling 3 from 0
    Calling 0 from 0
    Calling 1 from 0
    Calling 2 from 0
    Calling 3 from 0
    Calling 0 from 0
    Calling 1 from 0
    Calling 2 from 0
    Calling 3 from 0
Calling 1 from 0
    Calling 0 from 0
    Calling 1 from 0
    Calling 2 from 0
    Calling 3 from 0

#include <vector>                                                                                                                                                                                                                                                          
#include <omp.h>
#include <iostream>
#include <cstdio>
#include <limits>
#include <cstdint>
#include <cinttypes>

using namespace std;

const size_t  kM = 4;

struct Mat
{
 int elem[kM];

 Mat(const Mat& copy)
 {
  for (size_t i = 0; i<kM; ++i)
   this->elem[i] = copy.elem[i];
 }
 Mat()
 {
  for (size_t i = 0; i<kM; ++i)
    elem[i] = 0;
 }

 void do_mat(Mat& m)
 {
  #pragma omp parallel for
  for (int i = 0; i<kM; ++i)
  {
    printf(" \tCalling %d from %d\n", i, omp_get_thread_num());
    elem[i] += m.elem[i];
  }
 }
};

int main ()
{
  const int kN = 10;
  vector<Mat> matrices(kN);

  Mat m;
  #pragma omp parallel for
  for (int i = 0; i < kN; i++)
  {
    int tid = omp_get_thread_num();
    printf("Calling %d from %d\n", i, tid);
    matrices[i].do_mat(m);
  }

  return 0;
}

score 1 · Accepted Answer · answered Aug 08 '20 at 13:53

I'm not sure I understand what is that you expected, but the result you get is perfectly expected.

By default, OpenMP nested parallelism is disabled, meaning that any nested parallel region will create as many teams of 1 thread as there are of threads from the outer level encountering them.

In your case, you outermost parallel region creates a team of 8 threads. Each of these will reach the innermost parallel region, and create a second level 1-thread team. Each of these second level thread, in its own team, is ranked 0, hence the printed 0s you have.

With the very same code, compiled with g++ 9.3.0, by setting the 2 environment variables OMP_NUM_THREADS and OMP_NESTED, I get the following:

OMP_NUM_THREADS="2,3" OMP_NESTED=true ./a.out 
Calling 0 from 0
Calling 5 from 1
    Calling 0 from 0
    Calling 1 from 0
    Calling 2 from 1
    Calling 0 from 0
    Calling 1 from 0
    Calling 3 from 2
    Calling 3 from 2
    Calling 2 from 1
Calling 6 from 1
Calling 1 from 0
    Calling 0 from 0
    Calling 1 from 0
    Calling 3 from 2
    Calling 2 from 1
Calling 2 from 0
    Calling 0 from 0
    Calling 1 from 0
    Calling 2 from 1
    Calling 3 from 2
    Calling 0 from 0
    Calling 1 from 0
    Calling 3 from 2
    Calling 2 from 1
Calling 3 from 0
Calling 7 from 1
    Calling 0 from 0
    Calling 3 from 2
    Calling 2 from 1
    Calling 3 from 2
    Calling 0 from 0
    Calling 1 from 0
    Calling 1 from 0
    Calling 2 from 1
Calling 4 from 0
Calling 8 from 1
    Calling 0 from 0
    Calling 3 from 2
    Calling 2 from 1
    Calling 2 from 1
    Calling 0 from 0
    Calling 1 from 0
    Calling 3 from 2
    Calling 1 from 0
Calling 9 from 1
    Calling 2 from 1
    Calling 0 from 0
    Calling 1 from 0
    Calling 3 from 2

Maybe that corresponds better to what you expected to see?

What is the meaning of the number presented in the inner loop? In my case, everything is shown as 0. However, we have outer loops performed in parallel by 8 threads (number 0 - 7). I'm doubting whether inner loop is processed by the same thread numbered by 0. — user9414424, Aug 09 '20 at 13:57
Probably your misunderstanding comes from the fact that `omp_thread_num()` returns the thread id in the latest encountered team (the innermost one), not necessarily the first one. So the 0s you see are perfectly in line with that as all threads are master threads on the second level parallel regions with 8 different teams of 1 thread each. In my version, we get 2 second level teams of 3 threads each created. That is why you see ids spanning from 0 to 2 there. Does that make sense now? — Gilles, Aug 09 '20 at 14:20
Ah, I got the point. Thank you for your comment. When I removed the pragma statement of the inner loop, then my expectation comes. Does this mean using `omp parallel for` statement (without any specific options) in both inner/outer loop creates a extra overhead in the inner loop? — user9414424, Aug 09 '20 at 16:03

score -1 · Answer 2 · answered Aug 08 '20 at 09:20

-1

Unless you provide special options to OpenMP, it tries to split the work at compile time, and it's hard to do with nested parallelism, so it doesn't even try.

You can refer to this StackOverflow question for suggestions (e.g. using collapse in OpenMP 3.0+)

answered Aug 08 '20 at 09:20

Alexey S. Larionov

6,555
1
18
37

Pragma omp parallel for inside inner loop is not correctly ignored in nested loop

2 Answers2