0

I've watched the recorded openacc overview course videos up to lecture 3, which talks about expressing data movement. How would you move a gsl_matrix* from cpu to gpu using copy_in(). For example on the CPU I can do something like,

gsl_matrix *Z = gsl_matrix_calloc(100, 100),

which will give me a 100x100 matrix of zeroes. Now Z is a pointer to a gsl_matrix structure which looks like,

typedef struct{
  size_t size1;
  size_t size2;
  size_t tda;
  double * data;
  gsl_block * block;
  int owner;
  } gsl_matrix;

How would I express data movement of Z (which is a pointer) from the CPU to the GPU using copyin()?

navmendoza
  • 21
  • 1
  • 5

1 Answers1

1

I can't speak directly about using GSL within OpenACC data and compute regions but can give you general answer about aggregate types with dynamic data members.

The first thing to try, assuming you're using PGI compilers and a newer NVIDIA device, is CUDA Unified Memory (UVM). Compiling with the flag "-ta=tesla:managed", all dynamically allocated data will be managed by the CUDA runtime so you don't need to manage the data movement yourself. There is overhead involved and caveats but it makes things easier to get started. Note that CUDA 8.0, which ships with PGI 16.9 or later, improves UVM performance.

Without UVM, you need to perform a manual deep copy of the data. Below is the basic idea where you first create the parent structure on the device and perform an shallow copy. Next create the dynamic array, "data" on the device, copy over the initial values to the array, then attach the device pointer for data to the device structure's data pointer. Since "block" is itself an array of structs with dynamic data members, you'll need to loop through the array, creating it's data arrays on the device.

    matrix * mat = (matrix*) malloc(sizeof(matrix));
    #pragma acc enter data copyin(mat)
// Change this to the correct size of "data" and blocks
    #pragma acc enter data copyin(mat.data[0:dataSize]);
    #pragma acc enter data copyin(mat.block[0:blockSize]);
    for (i=0; i < blockSize; ++i) {
       #pragma acc enter data copyin(mat.block[i].data[0:mat.block[i].size])
    }

To Delete, walk the structure again, deleting from the bottom-up

for (i=0; i < blockSize; ++i) {
   #pragma acc exit data delete(mat.block[i].data)
}
#pragma acc exit data delete(mat.block);
#pragma acc exit data delete(mat.data);
#pragma acc exit data delete(mat);

When you update, be sure to only update scalars or arrays of fundamental data types. i.e., update "data" but not "block". Update does a shallow copy so updating "block" would update host or device pointers leading to illegal addresses.

Finally, be sure to put the matrix variable in a "present" clause when using it in a compute region.

#pragma acc parallel loop present(mat) 
Mat Colgrove
  • 5,441
  • 1
  • 10
  • 11
  • Mat, thanks so much! I have NVIDIA GTX 560 TI. Will CUDA UVM work on this? – navmendoza Oct 03 '16 at 22:51
  • I believe Fermi (cc2.x) is able to use UVM. I haven't used it on a GTX 560 in particular so can't be sure, but I don't see why not. – Mat Colgrove Oct 04 '16 at 15:00
  • Hi Mat, thanks, again. When I use the flag, "-ta=tesla:managed", my code compiles. However, when I try to run I try to execute I get the error, "Accelerator Fatal Error: No CUDA device code available". Does this mean I can't use UVM on a NVIDIA GTX 560 TI? – navmendoza Oct 12 '16 at 12:06
  • Looks like I'm wrong in that UVM is only supported on Kepler (cc30) architectures and above. You'll need to perform the manual deep copy in OpenACC. – Mat Colgrove Oct 12 '16 at 16:50