I am trying to understand why my OpenACC code runs 17036.0939901 times faster on Nvidia V100 GPU than on AMD Mi-250 GPU. It is a simple matrix-matrix multiplication code. Here is output which I obtained on Nvidia V100 GPU which took 2.2043999284505844E-002 sec:
[ilkhom@topaz-3 MCCC-FN-GPU_DEV]$ cat acc.f90
!nvfortran -fast -Minfo=accel -acc -gpu=lineinfo,ptxinfo acc.f90
program main
implicit none
integer :: nkgmax,nchmax, i, f, j, nr, k
real(kind=8), allocatable, dimension(:) :: cont_wave
real(kind=8), allocatable, dimension(:,:) :: vmat2D
real(kind=8) :: tmp
integer :: time1, time2, dt, count_rate, count_max
real(kind=8) :: secs_acc
nkgmax=2000
nr=2000
allocate(cont_wave(1:nkgmax*nr))
cont_wave(:)=0.d0
do i=1,nkgmax
do j=1,nr
cont_wave((i-1)*nr+j)=dble(i-j)/dble(i+j)!tmp !1.d0
enddo
enddo
!!!! OpenACC test:
!$acc enter data copyin(cont_wave,nr,nkgmax,nchmax)
allocate(vmat2D(1:nkgmax,1:nkgmax))
call system_clock(count_max=count_max, count_rate=count_rate)
call system_clock(time1)
!$acc kernels copyout(vmat2D) present(cont_wave,nkgmax)
!$acc loop independent vector(16)
do i=1,nkgmax
!$acc loop independent vector(16)
do j=1,nkgmax
if(j.gt.i)cycle
tmp=0.d0
!$acc loop seq
do k=1,nr
tmp=tmp+cont_wave((k-1)*nkgmax+i)*cont_wave((k-1)*nkgmax+j)
enddo
vmat2D(i,j)=tmp
if(i/=j)vmat2D(j,i)=vmat2D(i,j)
enddo
enddo
!$acc end kernels
call system_clock(time2)
dt = time2-time1
secs_acc = real(dt)/real(count_rate)
print*,'time in secs in OpenACC',secs_acc
print*,'min=',minval(vmat2D(1:nkgmax,1:nkgmax))
print*,'max=',maxval(vmat2D(1:nkgmax,1:nkgmax))
print*,'mean=',sum(vmat2D(1:nkgmax,1:nkgmax))/dble(nkgmax*nkgmax)
end program main
[ilkhom@t006 MCCC-FN-GPU_DEV]$ nvfortran -fast -Minfo=accel -acc -gpu=lineinfo,ptxinfo acc.f90 ; ./a.out
main:
25, Generating enter data copyin(cont_wave(:),nchmax,nr,nkgmax)
30, Generating copyout(vmat2d(:,:)) [if not already present]
Generating present(nkgmax,cont_wave(:))
32, Loop is parallelizable
34, Loop is parallelizable
Generating Tesla code
32, !$acc loop gang, vector(16) ! blockidx%x threadidx%x
34, !$acc loop gang, vector(16) ! blockidx%y threadidx%y
38, !$acc loop seq
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'main_34_gpu' for 'sm_70'
ptxas info : Function properties for main_34_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 176 registers, 392 bytes cmem[0]
time in secs in OpenACC 2.2043999284505844E-002
min= -760.4901596366437
max= 1973.862266351370
mean= 221.6705356107172
And here is the output on AMD Mi-250 GPU which took 329.869873046875 sec:
abdurakhmanov@uan01:/scratch/project_462000053/ilkhom/openacc/TEST> ftn -h acc -O3 acc_cray.f90 -o check_acc; srun ./check_acc
time in secs in OpenACC 329.869873046875
min= -760.49015963664374
max= 1973.8622663513693
mean= 221.67053561071717
One note: on AMD GPU I am using cray-ftn and since I was getting
!$acc loop independent vector(16)
ftn-7271 ftn: WARNING MAIN, File = acc_cray.f90, Line = 36
Unsupported OpenACC vector_length expression: Converting 16 to 1.
in the source code I changed !$acc loop independent vector(16)
to !$acc loop independent vector(32)
I also have more detailed logs on MI250 GPU generated by setting export CRAY_ACC_DEBUG=3
which I can attach if required.
Cheers, Ilkhom
I expected to see at least similar runtimes on NVIDIA V100 and AMD MI250 GPUs.