1

I would like to understand clearly a situations I faced often accelerating an application with OpenACC. Let's say I have this loop:

#pragma acc parallel loop collapse(4)
for (k = KBEG; k <= KEND; k++){
for (j = JBEG; j <= JEND; j++){
for (i = IBEG; i <= IEND; i++){
  for (nv = 0; nv < NVAR; nv++) A0[k][j][i][nv] =
                                data->A[k][j][i][nv];
}}}

Being data a structured type variable:

typedef struct Data_{
  double ****A;    
  double ****B;    
} Data;

I noticed that both with Unified Memory (-ta=tesla:managed) or not, I get an error at the execution: error 700: Illegal address during kernel execution. I identified the problem with the deep copy problem I read in literature: the implicit copy done by the compiler does a simple copy of A, that points to an address on the host memory, but not a copy of the data it is pointing to. The host address cannot be read by the device and this generates the error.

  1. Is the deep copy problem the correct interpretation of my error?

  2. Moreover, if I'm using Unified Memory and it is indeed a deep copy problem, shouldn't the device be capable of reading the address, being A, at least virtually, situated on unified memory and address space?

I can easly resolve the error adding the diretive:

#pragma acc enter data(data)

and adding present(data) to the parallel pragma. Notice that I don't need to copy manually A and B.

I would like to understand the reason of both the problem and the solution.

Steve
  • 89
  • 1
  • 6

1 Answers1

1

Unified memory is only available for allocated (heap) memory. I'm assuming that "data" itself is not allocated? In that case, you do need to include it in a data region and should add the "present" clause so the compiler doesn't try to implicitly copy it.

Mat Colgrove
  • 5,441
  • 1
  • 10
  • 11
  • Do you mean allocated on the host or device memory? 'data.A' and 'data.B' are allocated using malloc. I don't understand when Unified Memory is aviable then, since data is on heap memory I guess. And is this indeed related to deep copy problem? (question n°1) – Steve Nov 17 '20 at 08:48
  • 1
    Allocated on the host but when using CUDA unified memory this is allocated in a managed pool of memory that's accessible on either the host or device. So when "data.A" and "data.B" are allocated via malloc (though the compiler replaces malloc with a call to "cudaMallocManaged"), they use managed pointers. Since "data" is not malloc'd, it still needs to be manually managed via data directives. At some point we hope to be able to use unified memory on stack and static variables, but this requires an update to the Linux OS itself which has not been adopted as of yet. – Mat Colgrove Nov 17 '20 at 15:59
  • I don't understand though, what's the difference between the implicit copy of the compiler in the first case and the copy using `#pragma acc enter data(data)` that works. If the compiler copy "data" on the device and in it we have data.A and data.B that are managed pointers, hence readable by the compiler, why it doesn't work? – Steve Nov 17 '20 at 17:48
  • 1
    I'd want to see the compiler feedback messages, but my best guess is that it's trying to implicitly copy "data->A" as well as "data". It doesn't know that "A" is managed so must still try to copy it. Implicit copies of aggregate type with dynamic data members is difficult for the compiler so it's always best to manually manage them and use a "present" clause so no implicit copy is attempted. – Mat Colgrove Nov 17 '20 at 20:18
  • Ok, I understand, thank you. The compiler message is : **Generating implicit copyin(data[KBEG:KEND-KBEG+1])**. Here the size of data doesn't make sense to me. – Steve Nov 18 '20 at 07:51
  • 1
    Yes, that's odd but probably related to an known issue where the compiler doesn't know how to handle scalar pointers. By default it treats pointers as arrays so the compiler is attempting to use the loop bounds to as the array dimensions. I've talked with our compiler engineers about it, but it's a tough problems since it's difficult for the compiler to tell if a pointer is a scalar or points to an array, so they picked the more common usage of an array when doing implicit copies. – Mat Colgrove Nov 18 '20 at 18:53
  • 1
    As a general rule, I will look over the complier feedback to see where it's doing implicit copies and replace these with explicit copies or use the "present" clause (or more often add "default(present)". While the compiler gets the implicit copy correct in the vast majority of cases, there's enough of these problem cases that I don't like the compiler doing it implicitly. – Mat Colgrove Nov 18 '20 at 18:56