OpenMP - Run single region with nowait and after join other threads in for loop

Question

What I'm trying to do is that the first part of the function runs in a single thread and other thread start the second function (this function has three loops) and after the single thread ends the first function he join with others to help with loops.

I made the following code, which is wrong because all threads run the second function, and is only needed one to run but all help with loops.

void main(){


/* code */


#pragma omp parallel
{
     #pragma omp single nowait
     {
      std::cout << "Running func 1:" << omp_get_thread_num() << std::endl;
      func1();
      std::cout << "Finish func 1" <<std::endl;
     }

#pragma omp nowait

std::cout << "Running func 2:" << omp_get_thread_num() << std::endl;
func2();

#pragma omp barrier
}//close parallel


/* more code */

}//close main



void func2(void){

        /* code to read file */ 



#pragma omp parallel
{


     for (int i=40;i<60;i++){
          #pragma omp for nowait
          for (int j=0;j<100;j++){
           /* code  */
          }
      }

#pragma omp for schedule(dynamic,1) nowait
  for (int i=0;i<40;i++){
        for (int j=0;j<100;j++){
         /* code  */
       }
  }

#pragma omp for schedule(dynamic)
  for (int i=60;i<100;i++){
        for (int j=0;j<100;j++){
         /* code  */
        }
  }

         /* code to write file */ 

}//close parallel

#pragma omp barrier
} //close func2

My terminal shows:

Running func 1: 0
Running func 2: 1
Running func 2: 2
Running func 2: 3 
Finish func 1
Running func 2: 0

Edit

Obs.: Func1 should only be executed on one thread

func2 has been split into three for loops because the first for consumes more time than all the others together. If I only use one for loop, all other threads will end and some will continue to run. This way, the hardes ones are calculated first.

Using the code suggested by Jim Cownie, func1 and func2 run in parallel, however either the threads run func2 twice or only one thread runs alone without the help of the others.

Any ideas how I can do even if it's necessary using task or sections?

With omp single nowait you're telling explicitly to use the other threads to continue execution before completing the single. omp barrier at the end of the parallel without nowait is redundant, has no effect. It looks like your example shows it works as you instructed. Maybe if you didn't enable nested parallelism and now try doing so, then func 2 may take enough longer that func 1 completes before 3 threads complete func 2, but you still would have no guarantee on order of completion. Your entire last for() loop is executed by each thread. — tim18, Oct 18 '19 at 09:30

Jim Cownie · Accepted Answer · 2019-10-24T08:22:48.850

Your code has a number of issues

There is no such openMP directive as #pragma omp nowait, so you may not even be compiling with OpenMP enabled (as, when it is, you should get an error message; E.g., see https://godbolt.org/z/EbYV6h )
There is never any need for a #pragma omp barrier immediately before the end of a parallel region (since the master thread which will execute the next serial region cannot leave until all threads have also finished executing in the parallel region.)

I don't understand why you want to use nested parallelism. You are already executing func2() in parallel, so any nesting here will lead to over-subscription.

You can achieve what you want either like this

#pragma omp parallel
{
#pragma omp single nowait
    func1()
  func2();
}

void func2()
{
#pragma omp for schedule(dynamic), nowait
    for (...)
        ... etc ...
}

Or, by using tasks and taskloops, which is potentially a cleaner way of expressing it.

Using tasking, (and, after your clarification that you only want function2 executed once (I was reading what the code said, since that's easier than mind-reading!)), something like this works

#include <unistd.h>
#include <stdio.h>
#include <omp.h>

void function1()
{
  fprintf(stderr,"%d: entering function1\n", omp_get_thread_num());
  sleep(1);
  fprintf(stderr,"%d: leaving function1\n", omp_get_thread_num());
}

void function2()
{
  fprintf(stderr,"%d: entering function2\n", omp_get_thread_num());
#pragma omp taskloop grainsize(1)                                                                                   
  for (int i=0; i<10; i++)
    {
      fprintf(stderr,"%d: starting iteration %d\n",
                     omp_get_thread_num(),i);
      sleep(1);
      fprintf(stderr,"%d: finishing iteration %d\n",
                     omp_get_thread_num(),i);
    }
  fprintf(stderr,"%d: leaving function2\n", omp_get_thread_num());
}

int main()
{
#pragma omp parallel
  {
#pragma omp single
    {
      fprintf(stderr,"Executing with %d threads\n",
                      omp_get_num_threads());
#pragma omp task
      {
        function1();
      }
#pragma omp task
      {
        function2();
      }
    }
  }
}

Here's an execution on four threads, of course other interleavings are possible.

OMP_NUM_THREADS=4 ./a.out
Executing with 4 threads
3: entering function2
2: entering function1
0: starting iteration 0
1: starting iteration 1
3: starting iteration 9
1: finishing iteration 1
3: finishing iteration 9
0: finishing iteration 0
3: starting iteration 8
1: starting iteration 2
2: leaving function1
0: starting iteration 3
2: starting iteration 4
3: finishing iteration 8
1: finishing iteration 2
3: starting iteration 7
0: finishing iteration 3
0: starting iteration 6
2: finishing iteration 4
1: starting iteration 5
0: finishing iteration 6
3: finishing iteration 7
1: finishing iteration 5
3: leaving function2

You can see that only one thread executes each of function[12], and that the loop iterations are shared over all of the threads.

I removed what you said from the code. However, it continues to run func2 three times (I have 4 threads, one runs func1 and the other runs func2). I added some important points, before loops, func2 reads files and after loops it writes to disk. Any tips on how to do this with tasks or some other way to improve? — Guus, Oct 18 '19 at 22:16
Thanks so much Jim Cownie , that's exactly what I needed! I did not know "taskloop" and "grainsize", I will study about it.... After two weeks trying to solve the problem you were able to help me. thx! — Guus, Oct 23 '19 at 18:48
taskloop is cleaner syntax (and allows recursive task-creation), but you could just run the loop in function2 making using omp task wrapped around the body instead. (If you have an old OpenMP implementation that predates taskloop, that would be how to do the same thing there). — Jim Cownie, Oct 24 '19 at 08:21

OpenMP - Run single region with nowait and after join other threads in for loop

1 Answers1