0

we are trying to run two instances of cblas_dgemm in parallel. If the total number of threads is 16, we would like each instance to run using 8 threads. Currently, we are using a structure like this:

#pragma omp parallel num_threads(2)
{
   if (omp_get_thread_num() == 0){
     cblas_dgemm(...);
   }else {
     cblas_dgemm(...);
   }
}

Here is the issue:

At the top level, there are two OpenMP threads each of which is active inside one of the if/else blocks. Now, we expect those threads to call the cblas_dgemm functions is parallel, and inside those cblas_dgemm functions, we expect new threads to be spawned.

To set the number of threads internal to each cblas_dgemm, we set the corresponding environment variable: setenv OPENBLAS_NUM_THREADS 8 However, it doesn't seem to be working. If we measure the runtime for each of the parallel calls, the runtime values are equal, but they are equal to the runtime of a single cblas_dgemm call when nested parallelism is not used and the environment variable OPENBLAS_NUM_THREADS is set to 1.

What is going wrong? and how can we have the desired behavior? Is there any way we could know the number of threads inside the cblas_dgemm function?

Thank you very much for your time and help

sanaz
  • 21
  • 3

1 Answers1

0

The mechanism you are trying to use is called "nesting", that is, creating a new parallel region within an outer, existing parallel region is already active. While most implementations support nesting, it is disabled by default. Try setting OMP_NESTED=true on the command line or call omp_set_nested(true) before the first OpenMP directive in your code.

I would also change the above code to read like this:

#pragma omp parallel num_threads(2)
{
#pragma omp sections
#pragma omp section
    {
        cblas_dgemm(...);
    }
#pragma omp section
    {
        cblas_dgemm(...);
    }
}

That way, the code will also compute the correct thing with only one thread, serializing the two calls to dgemm. In your example with only one thread, the code would run but miss the second dgemm call.

Michael Klemm
  • 2,658
  • 1
  • 12
  • 15
  • Thank you very much Michael. We are using omp_set_nested(true). I have put print statements for printing the openMP thread IDs inside each section. I have also timed each section and compared it with overall timing. Our current openMP construct seems to work as intended. We are just unsure about the threads inside the openBLAS functions. – sanaz Mar 12 '19 at 16:20