OpenACC 2.0 routine: data locality

Question

Take the following code, which illustrates the calling of a simple routine on the accelerator, compiled on the device using OpenACC 2.0's routine directive:

#include <iostream>

#pragma acc routine
int function(int *ARRAY,int multiplier){
        int sum=0;

        #pragma acc loop reduction(+:sum)
        for(int i=0; i<10; ++i){
                sum+=multiplier*ARRAY[i];
        }

        return sum;
}

int main(){
        int *ARRAY = new int[10];
        int multiplier = 5;
        int out;

        for(int i=0; i<10; i++){
                ARRAY[i] = 1;
        }

        #pragma acc enter data create(out) copyin(ARRAY[0:10],multiplier)

        #pragma acc parallel present(out,ARRAY[0:10],multiplier)
        if (function(ARRAY,multiplier) == 50){
                out = 1;
        }else{
                out = 0;
        }

        #pragma acc exit data copyout(out) delete(ARRAY[0:10],multiplier)

        std::cout << out << std::endl;
}

How does function know to use the device copies of ARRAY[0:10] and multiplier when it is called from within a parallel region? How can we enforce the use of the device copies?

score 1 · Answer 1 · answered Aug 10 '15 at 20:41

When your routine is called within a device region (the parallel in your code), it is being called by the threads on the device, which means those threads will only have access to arrays on the device. The compiler may actually choose to inline that function, or it may be a device-side function call. That means that you can know that when the function is called from the device it will be receiving device copies of the data because the function is essentially inheriting the present data clause from the parallel region. If you still want to convince yourself that you're running on the device once inside the function, you could call acc_on_device, but that only tells you that you're running on the accelerator, not that you received a device pointer.

If you want to enforce the use of device copies more than that, you could make the routine nohost so that it would technically not be valid to call from the host, but that doesn't really do what you're asking, which is to do a check on the GPU that the array really is a device array.

Keep in mind though that any code inside a parallel region that is not inside a loop will be run gang-redundantly, so the write to out is likely a race condition, unless you happen to be running with one gang or you write to it using an atomic.

score 0 · Answer 2 · answered Mar 09 '15 at 03:59

Basically, when you involved "data" clause, the device will create/copy data to the device memory, then the block of code that defined with "acc routine" will be executed on the device. Notice that the memory between host and device does not share unlike multi-threading (OpenMP). So yes, "function" will be using the device copies of ARRAY and multiplier as long as it is under data segment. Hope this helps! :)

score 0 · Answer 3 · answered Apr 07 '15 at 05:50

0

You should assign the function with one parallelism level such as gang/worker/vector. It's a more accurate way.

The routine will use the date in device memory.

answered Apr 07 '15 at 05:50

Brian Yang

61
1
1
9

OpenACC 2.0 routine: data locality

3 Answers3