1

One MPI code, I am trying to parallelize a simple loop of it with openacc,and the output is not expected. Here, the loop has a call and I add a 'acc routine seq' in the subroutine. If I manually make this call inline and delete the subroutine, the result will be right. Do I use the OpenACC "routine" directive correctly? or other wrong?

  • Runtime environment

MPI version: openmpi4.0.5
HPC SDK 20.11
CUDA Version: 10.2

!The main program
program test
  use simple
  use mpi
  implicit none
  integer :: i,id,size,ierr,k,n1
  call MPI_INIT(ierr)
  call MPI_COMM_RANK(MPI_COMM_WORLD,id,ierr)
  call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierr)

  allocate(type1(m))
  do i=1,m
    allocate(type1(i)%member(n))
    type1(i)%member=-1
    type1(i)%member(i)=i
  enddo
  
  !$acc update device(m,n)
  do k=1,m
    n1=0
    allocate(dev_mol(k:2*k))
    dev_mol=type1(k)%member(k:2*k)
    !$acc update device(dev_mol(k:2*k))
    !$acc parallel copy(n1) firstprivate(k)
    !$acc loop independent
    do i=k,2*k
      call test1(k,n1,i)
    enddo
    !$acc end parallel
    !$acc update self(dev_mol(k:2*k))
    type1(k)%member(k:2*k)=dev_mol
    write(*,"('k=',I3,' n1=',I2)") k,n1
    deallocate(dev_mol)
  enddo
  
  do i=1,m
    write(*,"('i=',I2,' member=',I3)") i,type1(i)%member(i)
    deallocate(type1(i)%member)
  enddo
  deallocate(type1)
  call MPI_Barrier(MPI_COMM_WORLD,ierr)
  call MPI_Finalize(ierr)
end


!Here is the module
module simple
  implicit none
  integer :: m=5,n=2**15
  integer,parameter :: p1=15
  integer,allocatable :: dev_mol(:)
  type type_related
    integer,allocatable :: member(:)
  end type
  type(type_related),allocatable :: type1(:)
  
  !$acc declare create(m,n,dev_mol)
  !$acc declare copyin(p1)
  contains
    subroutine test1(k,n1,i)
      implicit none
      integer :: k,n1,i
      !$acc routine seq
      if(dev_mol(i)>0) then
        !write(*,*) 'gpu',k,n1,i
        n1=dev_mol(i)
        dev_mol(i)=p1
      else
        if(i==k)write(*,*) 'err',i,dev_mol(i)
      endif
    end
end
  • MPI

compile command:mpif90 test.f90 -o test
run command:mpirun -n 1 ./test
result as follow:

k=  1 n1= 1
k=  2 n1= 2
k=  3 n1= 3
k=  4 n1= 4
k=  5 n1= 5
i= 1 member= 15
i= 2 member= 15
i= 3 member= 15
i= 4 member= 15
i= 5 member= 15
  • MPI+OpenACC

compile command:mpif90 test.f90 -o test -ta=tesla:cuda10.2 -Minfo=accel
run command:mpirun -n 1 ./test
the error result as follow:

k=  1 n1= 0
k=  2 n1= 0
k=  3 n1= 0
k=  4 n1= 0
k=  5 n1= 0
i= 1 member= 1
i= 2 member= 2
i= 3 member= 3
i= 4 member= 4
i= 5 member= 5
Xin Ding
  • 13
  • 2
  • Do you even need MPI to evidence the issue? – Gilles Gouaillardet Apr 06 '21 at 04:00
  • 1
    Welcome, please take te [tour]. Please use tag [tag:fortran] for all Fortran questions, Fortran 90 is just one very old version of the standard. I also suggest to try to reproduce the issue in an even simpler program without MPI. Which compiler do you use? – Vladimir F Героям слава Apr 06 '21 at 05:58
  • @Vladimir F Thanks for your answer. The compiler that I use is NVFORTRAN 20.11. Just now, I remove the all the MPI code and reproduce the issue without MPI. Do I need to upload the new code? – Xin Ding Apr 06 '21 at 09:41
  • 1
    Yes please, update the code. We want to look at exactly what you are looking at – Ian Bush Apr 06 '21 at 10:05

1 Answers1

2

The problem is with "i" being passed by reference (default with Fortran). Simplest solution is to pass it by value:

  contains
    subroutine test1(k,n1,i)
      implicit none
      integer, value :: i
      integer :: n1, k

Now there is a small compiler bug in that since "i" is the loop index variable so should be implicitly privatizing the variable. However since it's being passed by reference, this causes it to be made shared. We'll get this fixed in a future compiler version. Though passing scalars by value when possible is generally advisible.

Example run with the update:

% mpif90 test2.f90 -acc -Minfo=accel -V21.2 ; mpirun -np 1 a.out
test1:
     16, Generating acc routine seq
         Generating Tesla code
test:
     48, Generating update device(m,n)
     53, Generating update device(dev_mol(k:k*2))
     54, Generating copy(n1) [if not already present]
         Generating Tesla code
         56, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     60, Generating update self(dev_mol(k:k*2))
k=  1 n1= 1
k=  2 n1= 2
k=  3 n1= 3
k=  4 n1= 4
k=  5 n1= 5
i= 1 member= 15
i= 2 member= 15
i= 3 member= 15
i= 4 member= 15
i= 5 member= 15
Mat Colgrove
  • 5,441
  • 1
  • 10
  • 11
  • Excellent. Thanks for a lot. – Xin Ding Apr 07 '21 at 00:28
  • FYI, the NVHPC 21.3 SDK fixed the issue about implicitly privatizing the loop index variable being passed by reference hence the original version will work as expected. However, I still recommend passing scalars by value when possible. – Mat Colgrove Apr 09 '21 at 16:20