openacc - discrepancies between ta=multicore and ta=nvidia compilation

Question

I have a code that is written in OpenMP originally. Now, I want to migrate it into OpenACC. Consider following:

1- First of all, OpenMP's output result is considered as final result and OpenACC output should follow them.

2- Secondly, there are 2 functions in the code that are enabled by input to the program on terminal. Therefore, either F1 or F2 runs based on the input flag.

So, as mentioned before, I transferred my code to OpenACC. Now, I can compile my OpenACC code with both -ta=multicore and -ta=nvidia to compile OpenACC regions for different architectures.

For F1, the output of both of the architectures are the same as OpenMP. So, it means that when I compile my program with -ta=multicore and -ta=nvidia, I get correct output results similar to OpenMP when F1 is selected.

For F2, it is a little bit different. Compiling with -ta=multicore gives me a correct output as the OpenMP, but the same thing does not happen for nvidia architecture. When I compile my code with -ta=nvidia the results are wrong.

Any ideas what might be wrong with F2 or even build process?

Note: I am using PGI compiler 16 and my NVIDIA GPU has a CC equal to 5.2.

Are you managing the data movement and synchronizing the host and device data when targeting the GPU? With multicore, data movement isn't necessary, but is with the GPU. Does the code get correct answers if you compile with "-ta=tesla:managed"? Managed enables CUDA Unified Memory thus eliminating the need to manage dynamic data. If managed works, then it's definitely a data movement issue. Posting a reproducing example would be helpful as well. — Mat Colgrove, Dec 22 '16 at 19:46
Thanks @MatColgrove It worked with managed! But, I tried to transfer all the required data to my device. Is there a way to find out whether I point to something garbage on my device or something legitimate (besides using acc_is_present)? — mgNobody, Dec 22 '16 at 20:32
Ok, so that means that you have a synchronization issue where either the host or device data isn't getting updated. I'm assuming that you are using unstructured data regions or a structure region that spans across multiple compute regions. In this case, put "update" directives before and after each compute region synchronizing the host and device copies. Next systematically remove each variable. If it fails, keep it in the update. Finally, once you know which variables are causing the problems, track their use and either use the update directive and/or add more compute regions. — Mat Colgrove, Dec 22 '16 at 23:11
@MatColgrove : THANKS :) Doing so helps me to find the culprit variable that messes everything up. Your method of updating everything and systematically removing them helped a lot. — mgNobody, Dec 23 '16 at 00:37

score 0 · Accepted Answer · answered Jan 05 '17 at 21:07

The reason that there were some discrepancies between two architectures was due to incorrect data transfer between host and device. At some point, host needed some of the arrays to redistributed data.

Thanks to comments from Mat Colgrove, I found the culprit array and resolved the issue by transferring it correctly.

At first, I enabled unified memory (-ta=nvidia:managed) to make sure that my algorithm is error-free. This helped me a lot. So, I removed managed to investigate my code and find the array that causes problem.

Then, I followed following procedure based on Mat's comment (super helpful):

Ok, so that means that you have a synchronization issue where either the host or device data isn't getting updated. I'm assuming that you are using unstructured data regions or a structure region that spans across multiple compute regions. In this case, put "update" directives before and after each compute region synchronizing the host and device copies. Next systematically remove each variable. If it fails, keep it in the update. Finally, once you know which variables are causing the problems, track their use and either use the update directive and/or add more compute regions.

openacc - discrepancies between ta=multicore and ta=nvidia compilation

1 Answers1