How does openMP parallelize these loops?

Question

Assume I have these loops:

#pragma omp parallel for
   for(int i=0;i<100;++i)
   { 
        // some big code here
#pragma omp parallel for
        for(int j=0;j<200;j++)
        {
            // some small code here
        }
    }

Which loop runs in parallel? Which one is the best to run in parallel?

The main point here is:

1- if the i-loop runs in parallel, since there is some big code there, there is a good chance that CPU cache hits on every iteration of the loop.

2- If the j-loop runs in parallel, since there is not much code there, it probably doesn't hit CPU cache, but I am losing running the big code in parallel.

I don't know how openMP runs these for loops in parallel so I can optimize them?

My code should run on windows (visual studio) and ARM Linux.

I could sit down and write some words for you, maybe even create a good answer. But you'll learn a lot more if you experiment yourself, try some variations of problem size, parallelisation strategy, schedule clause. Go on, roll up your sleeves and start coding, you know you want to. — High Performance Mark, May 07 '15 at 12:41
@HighPerformanceMark: Thank you for your suggestion. I can write some sample code and test, but then the result is valid for only one case and I can not learn the whole concept. I like to know the concept and then use my case ( or sample cases) to use the concept and optimize the code. I think, I can not get the concept from running a sample code, can I? — mans, May 07 '15 at 13:54
Just use "omp for" in the inner loop. You are already in a parallel region. Beware of nested OpenMP implementation hazards. — Jeff Hammond, May 08 '15 at 11:24

Raul · Answer 1 · 2015-05-08T08:12:06.990

Without enabling nesting (environment variable OMP_NESTED=true), only the outer loop will run in parallel.

If you enable nesting, both loops will run in parallel, but probably you will create too many threads.

You could use the omp parallel on the outer loop and for the inner loop use tasks grouping a number of iterations, for example:

#pragma omp parallel for
for (int i = 0; i<100; i++) {
    //big code here

    blocksize = 200/omp_get_num_threads();
    int j = 0;
    while(j < 200) {
        int mystart = j; int myend = j+(blocksize-1);
        #pragma omp task firstprivate(mystart,myend)
        {
            //small code here
        }
        if (j + blocksize >= 200) j = 200 - blocksize;
        else (j+=blocksize);
    }
    #pragma omp taskwait   
}

If you consider to use SIMD in the inner loop, then it can be written quite similar as to what you had:

#pragma omp parallel for
for (int i = 0; i<100; i++) {
    //big code here
    #pragma omp simd
    for (int j = 0; j<200; j++) {
        //small code here
    }   
}

But this latest option is very specific. Basically forces the compiler to vectorize the loop.

More info on the topic. In https://software.intel.com/en-us/articles/enabling-simd-in-program-using-openmp40 you will find an example where they use #pragma omp parallel for simd. That means to parallelize the loop and each thread will run its iteration space with vectorization applied. This will still requiere to enable nesting of parallel regions (OMP_NESTED) and depending on runtime implementation it can generate multiple teams of threads, up to one per each thread of the outer loop.

Depending on the runtime, it might be a better idea to use "taskgroup" construct for the nested tasks. — Raul, Nov 17 '15 at 14:17

score 0 · Answer 2 · edited May 23 '17 at 11:51

I agree that experimentation is a great way to learn about parallel programming, and you should try multiple combinations (inner only, outer only, both, something else?) to see what is the best for your code. The rest of my answer will hopefully give you a hint as to why the fastest way is fastest.

Nesting parallel regions can be done, but it is typically not what you want. Consider this question for a similar discussion.

When choosing which loop to parallelize, a common theme is to parallelize outermost loop first for multicore and the innermost loop first for SIMD. There are of course some caveats to this. Not all loops can be parallelized, so in that case you should continue on to the next loop. Additionally, locality, load balancing, and false-sharing may change which loop is optimal.

How does openMP parallelize these loops?

2 Answers2