How to minimize the overhead caused by the cuda memory related operations before and after launching a kernel function?

Question

to make things easy, here's the sample code ready to be compiled and run:

module elemWiseOps
   USE cudafor
   USE cublas
   !
   ! Definition of symbols for real types (RP) and complex ones (CP)
   !
   IMPLICIT NONE
   !
   INTEGER, PARAMETER :: SP = SELECTED_REAL_KIND(6,   37)     ! REAL32
   INTEGER, PARAMETER :: DP = SELECTED_REAL_KIND(15, 307)     ! REAL64
   !
   INTERFACE VMUL
       MODULE PROCEDURE dVmul
   END INTERFACE

   INTERFACE VADD
       MODULE PROCEDURE dVadd
   END INTERFACE

   Contains

       attributes(global) subroutine dVmul(a, b)
       !
       IMPLICIT NONE
       !
       REAL(DP), DIMENSION(:,:), DEVICE, INTENT(in)    :: a
       REAL(DP), DIMENSION(:,:), DEVICE, INTENT(inout) :: b
       !
       ! local variables
       !
       INTEGER, DEVICE                  :: i1, j, n(2)

       i1 = (blockIdx%x-1)*blockDim%x + threadIdx%x
       j = (blockIdx%y-1)*blockDim%y + threadIdx%y
       n(1) = size(a,1)
       n(2) = size(a,2)
       if (i1<=n(1) .and. j<=n(2)) b(i1,j) = a(i1,j) * b(i1,j)
     end subroutine dVmul

     attributes(global) subroutine dVadd(a, b)
       !
       IMPLICIT NONE
       !
       REAL(DP), DIMENSION(:,:), DEVICE, INTENT(in)    :: a
       REAL(DP), DIMENSION(:,:), DEVICE, INTENT(inout) :: b
       !
       ! local variables
       !
       INTEGER, DEVICE                  :: i1, j, n(2)

       i1 = (blockIdx%x-1)*blockDim%x + threadIdx%x
       j = (blockIdx%y-1)*blockDim%y + threadIdx%y
       n(1) = size(a,1)
       n(2) = size(a,2)
       if (i1<=n(1) .and. j<=n(2)) b(i1,j) = a(i1,j) + b(i1,j)
     end subroutine dVadd

     attributes(global) subroutine dPrintV(a)
       !
       IMPLICIT NONE
       !
       REAL(DP), DIMENSION(:,:), DEVICE, INTENT(in)  :: a
       !
       ! local variables
       !
       INTEGER, DEVICE    :: i1, j1, n(2)

       n(1) = size(a,1)
       n(2) = size(a,2)

       DO j1 = 1, n(2)
           Do i1 = 1, n(1)
              print*, a(i1,j1)
           ENDDO
       ENDDO
     end subroutine dPrintV

   end module elemWiseOps

   PROGRAM Test
   !
   ! This is the main program for Test
   !
   USE cudafor
   USE cublas
   USE elemWiseOps

   !
   IMPLICIT NONE
   !
   REAL(DP), ALLOCATABLE, DIMENSION(:,:)           :: a,   b
   REAL(DP), ALLOCATABLE, DEVICE, DIMENSION(:,:)   :: a_d, b_d
   !
   INTEGER, PARAMETER                              :: m = 500, n = 1000
   INTEGER    :: i1, i2, istat
   type(dim3) :: grid, tBlock
   !
   tBlock    = dim3(32,32,1)
   grid      = dim3(ceiling(real(m)/tBlock%x), &
                    ceiling(real(n)/tBlock%y), 1)
   !
   ! Allocate storage for the arrays
   !
   Allocate(a(m,n),b(m,n))
   Allocate(a_d(m,n),b_d(m,n))
   !
   ! Initialize the host arrays
   !
   Do i2 = 1, n
       Do i1 = 1, m
           a(i1, i2) = REAL(i1, DP)
           b(i1, i2) = REAL(i1, DP)
       Enddo
   Enddo
   !
   ! Copy to the device arrays
   !
   istat = cudaMemcpy2D(a_d, m, a, m, m, n)
   istat = cudaMemcpy2D(b_d, m, b, m, m, n)
   !!!
   !
   ! Now invoke the kernels
   !
   Call vadd<<<grid,tBlock>>>(a_d, b_d)
   Call vadd<<<grid,tBlock>>>(a_d, b_d)
   Call vadd<<<grid,tBlock>>>(a_d, b_d)
   !
   ! Free storage for the arrays
   !
   Deallocate(a,b)
   Deallocate(a_d,b_d)
   !
END PROGRAM Test

I compiled the code with

nvfortran -O3 -cuda -fast -gpu=cc60 -lcufft -lcublas -Minfo=accel test.f90

Then I profiled it with

nvprof --print-api-trace ./a.out

Here's what happened if i call the kernel 3 times in a row:

... ...
171.32ms  287.59ms  cudaMalloc
458.94ms  98.290us  cudaMalloc
462.69ms  816.76us  cudaMemcpy2D
463.51ms  869.13us  cudaMemcpy2D
464.39ms  134.45us  cudaMalloc
464.52ms  12.703us  cudaMemcpy
464.54ms  5.2460us  cudaMalloc
464.54ms  4.1840us  cudaMemcpy
464.55ms  52.608us  cudaLaunchKernel (elemwiseops_dvadd_ [119])
464.61ms  123.66us  cudaFree
464.73ms  78.753us  cudaFree
464.81ms  107.87us  cudaMalloc
464.92ms  6.6480us  cudaMemcpy
464.92ms  3.7720us  cudaMalloc
464.93ms  3.8070us  cudaMemcpy
464.93ms  10.245us  cudaLaunchKernel (elemwiseops_dvadd_ [126])
464.94ms  124.35us  cudaFree
465.07ms  56.498us  cudaFree
465.12ms  93.173us  cudaMalloc
465.22ms  6.2260us  cudaMemcpy
465.22ms  3.6010us  cudaMalloc
465.23ms  3.6800us  cudaMemcpy
465.23ms  7.6920us  cudaLaunchKernel (elemwiseops_dvadd_ [133])
465.24ms  125.72us  cudaFree
465.37ms  47.850us  cudaFree
465.64ms  60.294us  cudaFree
465.70ms  208.73us  cudaFree

My question is, is it possible to avoid the overhead by 'cudaMalloc', 'cudaMemcpy' and 'cudaFree' when invoking a kernel in cuda Fortran? Many thanks!

I don’t think so. Those caused by the allocate statements are not included here, and you can find additional mem operations if you run the nvprof yourself. — Ya Squall, Dec 14 '20 at 10:43
And there is a pair of calls to cudamemalloc and cudamemcpy corresponding to each of the kernel arguments. — Ya Squall, Dec 14 '20 at 10:48

Ya Squall · Answer 1 · 2020-12-15T05:32:22.120

1

Reason found. Instead of using deferred dimension arrays as arguments, using explicit dimension arrays elimilates the creation of array temporaries. i.e., Revising the original kernel to the following form solves the problem:

attributes(global) subroutine dVadd(m, n, a, b)
!
IMPLICIT NONE
!
integer, value :: m, n
REAL(DP), DIMENSION(1:m,1:n), DEVICE, INTENT(in)    :: a
REAL(DP), DIMENSION(1:m,1:n), DEVICE, INTENT(inout) :: b
!
! local variables
!
INTEGER, DEVICE                  :: i1, j

i1 = (blockIdx%x-1)*blockDim%x + threadIdx%x
j = (blockIdx%y-1)*blockDim%y + threadIdx%y

if (i1<=m .and. j<=n) b(i1,j) = a(i1,j) + b(i1,j)

end subroutine dVadd

And here's the updated nvprof output:

... ...
171.32ms  287.59ms  cudaMalloc
458.94ms  98.290us  cudaMalloc
462.69ms  816.76us  cudaMemcpy2D
463.51ms  869.13us  cudaMemcpy2D
464.55ms  52.608us  cudaLaunchKernel (elemwiseops_dvadd_ [119])
465.12ms  52.608us  cudaLaunchKernel (elemwiseops_dvadd_ [126])
465.61ms  52.608us  cudaLaunchKernel (elemwiseops_dvadd_ [133])
466.93ms  123.66us  cudaFree
467.75ms  78.753us  cudaFree

This conclusion is somehow conflicting to my experience in ordinary Fortran code, where there should be no temporay arrays being created when passing allocatable arrays as actual arguments to a subroutine declaring DEFERRED-SHAPE arrays as dummy arguments.

edited Dec 15 '20 at 05:32

answered Dec 15 '20 at 05:21

Ya Squall

21
3

I was assuming something like that but had no way to test it. But it is quite unfortunate. You might also try if the `contiguous` attribute fixes it instead, even with assumed shape. The problem is that in regular Fortran assumed arrays may non-contiguous and hence the iterations are non-trivial, one must be prepared that there are additional strides. That must be avoided in CUDA. I use the contiguous attribute in my code a lot. Beware, if the actual argument is not actually contiguous, a copy will be made. – Vladimir F Героям слава Dec 15 '20 at 09:56
The additional copy operations are of a relatively small size (176 bytes according to my test) so I don't think they are "temporary arrays" according to [this usage](https://stackoverflow.com/questions/28859524/fortran-runtime-warning-temporary-array). I also don't think a `contiguous` guarantee would change anything, but I haven't tried it. In order to perform 2D array indexing, the device code compiler needs to know the width of the array (for the 2D case). That information has to come from somewhere. The `cudaMemcpy2D` operation is copying data (only) and no other information. – Robert Crovella Dec 15 '20 at 21:42
Therefore I suspect the additional copies (one per array) are carrying the additional array shape information. When you make the array shape explicit, the compiler can generate the necessary indexing with that information at compile time, and does not need to schedule the extra runtime copy operations. I wouldn't expect to see this for a 1D array, whether shape is explicit or not. – Robert Crovella Dec 15 '20 at 21:44
Just tested on 1D arrays, same behavior as 2D ones, i.e., assumed shape will cause an additional 176bytes copy and explicit ones will not. – Ya Squall Dec 16 '20 at 23:40

How to minimize the overhead caused by the cuda memory related operations before and after launching a kernel function?

1 Answers1