1

On a Fortran program accelerated with OpenACC, I need to duplicate an array on GPU. The duplicated array will only be used on GPU and will never be copied on host. The only way I know to create it would be to declare and allocate it on host, then acc data create it:

program test
    implicit none
    integer, parameter :: n = 1000
    real :: total
    real, allocatable :: array(:)
    real, allocatable :: array_d(:)

    allocate(array(n))
    allocate(array_d(n))

    array(:) = 1e0

    !$acc data copy(array) create(array_d) copyout(total)

    !$acc kernels
    array_d(:) = array(:)
    !$acc end kernels

    !$acc kernels
    total = sum(array_d)
    !$acc end kernels

    !$acc end data

    print *, sum(array)
    print *, total

    deallocate(array)
    deallocate(array_d)
end program

This is an illustration code, as the program in question is much more complex.

The problem with this solution is that I have to allocate the duplicated array on host, even if I do not use it here. Some host memory would be wasted, especially for large arrays (even if I know I would run out of device memory before running out of host memory). On CUDA Fortran, I know I can declare a device only array, but I do not know if this is possible with OpenACC.

Is there a better way to perform this?

Neraste
  • 485
  • 4
  • 15

1 Answers1

2

The OpenACC spec has the "acc declare device_resident" which allocates a device only array which you'd use instead of a "data create". Something like:

    implicit none
    integer, parameter :: n = 1000
    real :: total
    real, allocatable :: array(:)
    real, allocatable :: array_d(:)
    !$acc declare device_resident(array_d)
    allocate(array(n))
    allocate(array_d(n))

    array(:) = 1e0

    !$acc data copy(array) copyout(total)

    !$acc kernels
    array_d(:) = array(:)
    !$acc end kernels

    !$acc kernels
    total = sum(array_d)
    !$acc end kernels

    !$acc end data

    print *, sum(array)
    print *, total

    deallocate(array)
    deallocate(array_d)
end program

Though due to complexity in implementation and lack of compelling use case, our compiler (NVHPC aka PGI) treats device_resident as a create, i.e the host array is still allocated. So if you're using NVHPC and truly need a device only array, then you'll want to use a CUDA Fortran "device" attribute on the array. CUDA Fortran and OpenACC are interoperable, so it's fine to mix them.

However, wasting a bit of host memory isn't an issue for the vast majority of codes, and since no data is copied, there's no performance impact. Hence if you kept the code as is, it shouldn't be a problem.

Mat Colgrove
  • 5,441
  • 1
  • 10
  • 11
  • How might one go about trying to supply a "compelling use case"? One of my collaborators is not using openACC due to the host array allocation leading to excessive memory use on the host – Ian Bush Sep 18 '21 at 08:09
  • There's number of factor that our management uses to prioritize features such as other current projects, difficultly of implementation, if the requester has purchased support, general usefulness of the feature, if the requester is blocked and no viable work around is available, plus other factors. So my first question to your collaborator would be if using CUDA Fortran "device" attributes would be a sufficient work around? If not, then I'd need to get more details about the project and present it to our management, which I'm happy to do. – Mat Colgrove Sep 18 '21 at 15:41
  • 1
    I'm not sure is SO has direct messaging, but if so, please feel free to contact me. If not, and you don't mind registering on NIVIDIA's developer forums, you can contact me there. I moderate the NV HPC SDK forum at: https://forums.developer.nvidia.com/c/accelerated-computing/hpc-compilers/299/l/latest – Mat Colgrove Sep 18 '21 at 15:43
  • I'll chat to him. I doubt CUDA Fortran is an acceptable alternative for portability reasons, but that's his call not mine - the code is a well known HPC electronic structure application. – Ian Bush Sep 18 '21 at 16:06