Xeon-Phi asynchronous offload from host openMP parallel region

Question

I am using intel's offload pragmas in host openMP code. The code looks as follows

int s1 = f(a,b,c);

#prama offload singnal(s1) in (...) out(x:len)
{
    for (int i = 0; i < len; ++i)
    {
        x[i] = ...
    }   
}

#pragma omp parallel default(shared)
{
    #pragma omp for schedule(dynamic) nowait
    for (int i = 0; i < count; ++i)
    {
        /* code */
    }

    #pragma omp for schedule(dynamic) 
    for (int j = 0; j < count2; ++j)
    {
        /* code */
    }
}

#pragma offload wait(s1)
{
    /* code */
}

The code offload calculation of $x$ to MIC. The code keeps itself busy by assining some openMP to CPU cores. The above code works as expected. However, the first offload pragma takes a lot of time and has become the bottleneck. Nevertheless overall , it pays off to offload computation of $x$ to MIC. One way to potentially overcome this latency issue I'm trying is as follows

int s1 = f(a,b,c);

#pragma omp parallel default(shared)
{
    #pragma omp single nowait
    {
        #prama offload singnal(s1) in (...) out(x:len)
        {
            for (int i = 0; i < len; ++i)
            {
                x[i] = ...
            }   
        }

    }

    #pragma omp for schedule(dynamic) nowait
    for (int i = 0; i < count; ++i)
    {
        /* code */
    }

    #pragma omp for schedule(dynamic) 
    for (int j = 0; j < count2; ++j)
    {
        /* code */
    }
}

#pragma offload wait(s1)
{
    /* code */
}

SO this new code, assigns a thread to do the offload while other openmp threads can be used for other worksharing constructs. However this code doesn't work. I get following error message

device 1 does not have a pending signal for wait(0x1)

Offload report points that above piece of code is the main culprit. One temporary work around is using a constant as signal i.e. signal(0), which works. However, I need a more permanent solution. Can anyone shade light on what is going wrong in my code.

Thanks

This is a duplicate of https://software.intel.com/en-us/forums/topic/509845 — Jeff Hammond, Aug 18 '14 at 13:49

score 1 · Answer 1 · answered Apr 25 '14 at 07:47

Let me complement Taylor's reply a bit.

The first offload indeed takes more time than subsequent offloads, because of the initialization stuff going on. Taylor sketched some of the things going on there. You can avoid the dummy offload by using the environment variable OFFLOAD_INIT=on_start. That should let the runtime system do all the initialization ahead of time. The overhead of this does not go away, but it moves from your first offload to the application initialization.

The problem with your second code snippet seems to be that your offloads target different devices. Signalling and waiting only works if the signal and wait happen for the same target device. Since you do not explicitly use the target(mic:0) clause with your offloads, chances are high that the runtime system selects different target devices.

One recommendation i would like to make is to not use plain integers for the signalling. Usually, the signal indicates that a certain buffer is ready. In these cases, it is good practice to use the buffer pointer as the signal handle, since it will be unique for concurrent offloads working with different buffers.

Cheers, -michael

They are targeting the same device. i.e. mic:1. And this is done explicitly — arbitUser1401, Apr 25 '14 at 15:53

score 0 · Answer 2 · answered Apr 24 '14 at 20:38

I can't comment on the 2nd code block. I have some observations about the first.

The first offload always takes a longer period of time since it also setups the offload infrastructure. This structure includes things such as passing environmental variables, copying over the mic implementation of libomp5, setting up the thread pool, etc.

The way to avoid this is to setup a dummy offload first, meaning it doesn't really do anything and is not part of your computation block.

An excellent set of references on optimizing for the xeon phi coprocessor is under the training tab at software.intel.com/mic-developer.

Also take a look at software.intel.com/en-us/articles/programming-and-compiling-for-intel-many-integrated-core-architecture, software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization, and software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization.

Sorry about the long URLs but stackoverflow doesn't allow me to include more than two links as I'm new.

Thanks for your response. This whole code block is inside a sequential for loop. Hence function is called in every iteration of outer for-loop. Initialization is done before the outer for-loop. This initialization takes 7 seconds in my case (due to huge number of mallocs). After for round of mallocs, every allocated area is "reused". I am neglecting this 7 seconds all-together. It is the inside outer-for-loop, each MIC offload that takes 4 milliseconds, is what I am trying to hide. — arbitUser1401, Apr 25 '14 at 15:59

Xeon-Phi asynchronous offload from host openMP parallel region

2 Answers2