On a Fortran program accelerated with OpenACC, I need to duplicate an array on GPU. The duplicated array will only be used on GPU and will never be copied on host. The only way I know to create it would be to declare and allocate it on host, then acc data create
it:
program test
implicit none
integer, parameter :: n = 1000
real :: total
real, allocatable :: array(:)
real, allocatable :: array_d(:)
allocate(array(n))
allocate(array_d(n))
array(:) = 1e0
!$acc data copy(array) create(array_d) copyout(total)
!$acc kernels
array_d(:) = array(:)
!$acc end kernels
!$acc kernels
total = sum(array_d)
!$acc end kernels
!$acc end data
print *, sum(array)
print *, total
deallocate(array)
deallocate(array_d)
end program
This is an illustration code, as the program in question is much more complex.
The problem with this solution is that I have to allocate
the duplicated array on host, even if I do not use it here. Some host memory would be wasted, especially for large arrays (even if I know I would run out of device memory before running out of host memory). On CUDA Fortran, I know I can declare a device only array, but I do not know if this is possible with OpenACC.
Is there a better way to perform this?