3

I'm doing some tests with a simple code which is written below. The problem is that in a four core machine, I'm only getting 75% of load. The fourth core is idling, doing nothing. The code has an omp parallel, then an omp single inside of which the thread generates a task. That task generates a number of grandchildren tasks. The task will wait in a barrier until all of its children (grandchildren for the thread in the single region) finish and the thread executing the single region waits on another barrier until its direct descendant task finishes. The problem is that the thread executing the single region does not execute any of the grandchildren tasks. Given the blocksize I'm using, I'm creating thousands of tasks, so it's not a problem of available parallelism.

Am I misunderstanding OpenMP tasking? Is it related to the taskwait only waiting for the direct children? If so, how could I get the idle thread to execute available work? Imagine that I wanted to create tasks with dependencies as in OpenMP 4.0, then I would not be able to exploit all the threads available with dependencies. The barrier in the parent task would be needed as I would not want to free next tasks dependent on it until all of its children has finished.

#include <iostream>
#include <cstdlib>

#include <omp.h>

using namespace std;

#define VECSIZE 200000000

float* A;
float* B;
float* C;

void LoopDo(int start, int end) {
    for (int i = start; i < end; i++)
    {
        C[i] += A[i]*B[i];
        A[i] *= (B[i]+C[i]);
        B[i] = C[i] + A[i];
        C[i] *= (A[i]*C[i]);
        C[i] += A[i]*B[i];
        C[i] += A[i]*B[i];
        ....
    }


void StartTasks(int bsize)
{
    int nthreads = omp_get_num_threads();
    cout << "bsize is: " << bsize << endl;
    cout << "nthreads is: " << nthreads << endl;
    #pragma omp task default(shared)
    {
        for (int i =0; i <VECSIZE; i+=bsize)
        {
            #pragma omp task default(shared) firstprivate(i,bsize)
            LoopDo(i,i+bsize);
            if (i + bsize >= VECSIZE) bsize = VECSIZE - i;
        }
        cerr << "Task creation ended" << cerr;
        #pragma omp taskwait
    }
    #pragma omp taskwait
}


int main(int argc, char** argv)
{
    A = (float*)malloc(VECSIZE*sizeof(float));
    B = (float*)malloc(VECSIZE*sizeof(float));
    C = (float*)malloc(VECSIZE*sizeof(float));
    int bsize = atoi(argv[1]);
    for (int i = 0; i < VECSIZE; i++)
    {
        A[i] = i; B[i] = i; C[i] = i;
    }
    #pragma omp parallel
    {
        #pragma omp single
        {
            StartTasks(bsize);
        } 
    }
    free(A);
    free(B);
    free(C);
    return 0;
}

EDIT:

I tested with ICC 15.0 and it employs all the cores of my machine. Although ICC forks 5 threads instead of 4 like GCC does. The fifth ICC thread remains idle.

EDIT 2: The following change, adding a loop with as many top level tasks as threads, gets all threads feeded with tasks. If top level tasks < ntthreads then at some executions the master thread won't execute any task and will remain idle as before. ICC as always will generate a binary which allows to use all cores.

 for (int i = 0; i<nthreads;i++)
 {
    #pragma omp task default(shared)
    {
        for (int i =0; i <VECSIZE; i+=bsize)
        {
            #pragma omp task default(shared) firstprivate(i,bsize)
            LoopDo(i,i+bsize);
            if (i + bsize >= VECSIZE) bsize = VECSIZE - i;
        }
        cerr << "Task creation ended" << cerr;
        #pragma omp taskwait
    }
  }
  #pragma omp taskwait
Raul
  • 373
  • 2
  • 7
  • I am not sure whether this is revelant but is nesting parallelism is enabled with `omp_set_nested` or `OMP_NESTED` ? – coincoin May 04 '15 at 12:04
  • 1
    With OMP_NESTED. Although in this case nested parallelism is not needed. Nested parallelism is only needed for nested parallel regions, not tasks. – Raul May 04 '15 at 17:58

0 Answers0