How does the OpenACC copyin directive work?

Question

According to the OpenACC documentation:

copyin - Create space for the listed variables on the device, initialize the variable by copying data to the device at the beginning of the region, and release the space on the device when done without copying the data back the the host.

I've created a test example program

int main(int argc, char** argv)
{
    int teste[] = { -15 };


    #pragma acc data copyin(teste[0:1])
    {

        #pragma acc parallel loop
        for (int p = 0; p < 5000; p++) {
            teste[0] = p;
        }
    }
    printf("%d", teste[0]);
    return 0;
}

According to the Docs the program should output -15 since the data is modified on the device and the result is not copied back to the host. But once I compile and run this code, the output is 4999

My compiler is gcc (tdm64-1) 10.3.0 and I'm running the program at a computer with separate device and host memory

I'd like to know why is this not working, and what could I do to prevent the copy from the device back to the host.

Here's the program running using git bash on windows:

$ cat test.c && echo "" &&gcc -fopenacc test.c && ./a.exe

#include <stdio.h>

int main(int argc, char** argv)
{
    int teste[] = { -15 };


    #pragma acc data copyin(teste[0:1])
    {

        #pragma acc parallel loop
        for (int p = 0; p < 5000; p++) {
            teste[0] = p;
        }
    }
    printf("%d\n", teste[0]);
    return 0;
}
4999

I also got access to a Linux Machine, and even using nvc I could not get the correct results

cat test.c && echo "" && /opt/nvidia/hpc_sdk/Linux_x86_64/2021/compilers/bin/nvc -acc -Minfo=accel test.c && ./a.out
#include <stdio.h>

int main(int argc, char** argv)
{
    int teste[] = { -15 };


    #pragma acc data copyin(teste[0:1])
    {

        #pragma acc parallel loop
        for (int p = 0; p < 5000; p++) {
            teste[0] = p;
        }
    }
    printf("%d\n", teste[0]);
    return 0;
}

main:
    9, Generating copyin(teste[:]) [if not already present]
       Generating NVIDIA GPU code
       12, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
4999

By the way: at this time, GCC doesn't support GPU code offloading on Windows. See , or , for example. It's certainly possible to implement, but somebody needs to do it, or pay for the work. — tschwinge, Jan 26 '22 at 07:53

score 1 · Answer 1 · answered Dec 17 '21 at 20:01

1

The program should print -15 since the value isn't changed on the host. Hence this is either a bug in gcc or you're not actually enabling OpenACC. What compiler flags are you using?

Here's the output using nvc targeting an NVIDIA A100:

% cat test.c
#include <stdio.h>

int main(int argc, char** argv)
{
    int teste[] = { -15 };


    #pragma acc data copyin(teste[0:1])
    {

        #pragma acc parallel loop
        for (int p = 0; p < 5000; p++) {
            teste[0] = p;
        }
    }
    printf("%d\n", teste[0]);
    return 0;
}
% nvc test.c -acc -Minfo=accel ; a.out
main:
     10, Generating copyin(teste[:]) [if not already present]
         Generating NVIDIA GPU code
         13, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
-15

answered Dec 17 '21 at 20:01

Mat Colgrove

5,441
1
10
11

Running on Windows, the same commands you used through git bash (but with gcc instead of nvc) gave me the same result. I edited the question to show the output – Yrk06 Dec 17 '21 at 21:15
Odd, it's like it's not actually running on the GPU (I assume you do have an NVIDIA GPU installed?). Can you set the environment variable "NV_ACC_NOTIFY=3" with the nvc build? This will have the NV OpenACC print the kernel launches and data movement from the run. We can see if it's actually running on the GPU. If it doesn't show anything, it running on the CPU. In that case, post the output from running 'nvaccelinfo' or 'nvidia-smi' so I can what device and CUDA driver version you're using. – Mat Colgrove Dec 17 '21 at 22:28
Note that when running on a target with a shared memory as the host (i.e. targeting multicore CPU or using CUDA Unified Memory), the data regions are essentially ignored, so the 4999 answer would be expected. I know this is just an example to understand the constructs, but In general, don't rely on having discreate memories. – Mat Colgrove Dec 17 '21 at 22:31
On the linux machine I tried using NV_ACC_NOTIFY=3 and nothing was shown, I tried running both `nvaccelinfo` and `nvidia-smi` but neither of these commands was found, the system admin told me this machine has access to a GPU but I'll have to check with the System Admins if this machine actually has an Nvidia GPU and if the Virtual Machine running linux has access to it. – Yrk06 Dec 17 '21 at 23:39
In the mean time the windows computer has a GTX1080, and using `GOMP_DEBUG=1` (which should do the same as NV_ACC_NOTIFY=3) returns: ``` $ GOMP_DEBUG=1 ; ./a.exe 4999 GOACC_data_start: mapnum=1, hostaddrs=0000003baefff898, size=00007ff7cc682010, kinds=00007ff7cc682018 GOACC_parallel_keyed: mapnum=1, hostaddrs=0000003baefff890, size=00007ff7cc682020, kinds=00007ff7cc682028 GOACC_data_end: restore mappings GOACC_data_end: mappings restored ``` – Yrk06 Dec 17 '21 at 23:40
For NVHPC, since there's no GPU (or CUDA installed), it's falling back to host. For GNU, I'm not sure. Let me ping Thomas since he works on OpenACC support in the GNU compilers. (though he may be out for the holidays) I work for NVIDIA on the NVHPC compilers so don't have much insight into GNU. – Mat Colgrove Dec 20 '21 at 20:55
It's the same as in the Nvidia setting: the runtime doesn't detect the GPU, and thus uses host-fallback execution, which gives you the result you're seeing. – tschwinge Jan 26 '22 at 07:44

How does the OpenACC copyin directive work?

1 Answers1