openMP nested parallel for loops vs inner parallel for

Question

If I use nested parallel for loops like this:

#pragma omp parallel for schedule(dynamic,1)
for (int x = 0; x < x_max; ++x) {
    #pragma omp parallel for schedule(dynamic,1)
    for (int y = 0; y < y_max; ++y) { 
    //parallelize this code here
   }
//IMPORTANT: no code in here
}

is this equivalent to:

for (int x = 0; x < x_max; ++x) {
    #pragma omp parallel for schedule(dynamic,1)
    for (int y = 0; y < y_max; ++y) { 
    //parallelize this code here
   }
//IMPORTANT: no code in here
}

Is the outer parallel for doing anything other than creating a new task?

score 40 · Accepted Answer · answered May 10 '12 at 20:47

40

If your compiler supports OpenMP 3.0, you can use the collapse clause:

#pragma omp parallel for schedule(dynamic,1) collapse(2)
for (int x = 0; x < x_max; ++x) {
    for (int y = 0; y < y_max; ++y) { 
    //parallelize this code here
    }
//IMPORTANT: no code in here
}

If it doesn't (e.g. only OpenMP 2.5 is supported), there is a simple workaround:

#pragma omp parallel for schedule(dynamic,1)
for (int xy = 0; xy < x_max*y_max; ++xy) {
    int x = xy / y_max;
    int y = xy % y_max;
    //parallelize this code here
}

You can enable nested parallelism with omp_set_nested(1); and your nested omp parallel for code will work but that might not be the best idea.

By the way, why the dynamic scheduling? Is every loop iteration evaluated in non-constant time?

answered May 10 '12 at 20:47

Hristo Iliev

72,659
12
135
186

1

I'm using VS2008 so I don't think I can use collapse, I thought about doing it the second way you mentioned but was hoping to not have to change the code significantly. It's for a ray tracer so some primary rays can take up to 10 times longer than others – Scott Logan May 10 '12 at 20:56
Apparently even VS2010 supports OpenMP 2.0 only. – Hristo Iliev May 10 '12 at 21:05
Just be careful, integer divison and modulo are relatively expensive operations. If the loop body does little work, the overhead may be significant. – Daniel Langr Mar 01 '16 at 08:26
Shouldn't `x` and `y` be marked `private` in the last example? – ars Jan 26 '17 at 07:11
2

@ars, both variables are declared inside the parallel region and are therefore predetermined to be `private`. Also, because the variables do not exist in the outer scope, adding `private(x,y)` will result in an error. – Hristo Iliev Jan 26 '17 at 15:16

Walter · Answer 2 · 2012-05-10T20:05:14.033

12

NO.

The first #pragma omp parallel will create a team of parallel threads and the second will then try to create for each of the original threads another team, i.e. a team of teams. However, on almost all existing implementations the second team has just only one thread: the second parallel region is essentially not used. Thus, your code is more like equivalent to

#pragma omp parallel for schedule(dynamic,1)
for (int x = 0; x < x_max; ++x) {
    // only one x per thread
    for (int y = 0; y < y_max; ++y) { 
        // code here: each thread loops all y
    }
}

If you don't want that, but only parallelise the inner loop, you can do this:

#pragma omp parallel
for (int x = 0; x < x_max; ++x) {
    // each thread loops over all x
#pragma omp for schedule(dynamic,1)
    for (int y = 0; y < y_max; ++y) { 
        // code here, only one y per thread
    }
}

edited May 10 '12 at 20:05

answered May 10 '12 at 19:49

Walter

44,150
20
113
196

I see, I considered that but it seems so counter intuitive. So if I want a 'parallel for' over all the iterations I should put the 'parallel for' on the inner loop? – Scott Logan May 10 '12 at 19:57
@Bunnit I don't know what you want, but I added to my answer. – Walter May 10 '12 at 20:06
@Bunnit I think my second solution should achieve at least the same speed as the second of Hrito Iliev, but is conceptually much more clear. If the workload in the inner loop is not vastly different between different values for `y`, the first solution is even preferrable. In all cases, you parallelise the double loop in the sense that each pair (`x`,`y`) is only envoked once by the whole team of threads. – Walter May 11 '12 at 08:21

openMP nested parallel for loops vs inner parallel for

2 Answers2

Linked