Inter GPU communication in MPI+OpenACC programming

Question

I am trying to learn how to perform inter-gpu data communication using the following toy code. The task of the program is to send array 'a' data in gpu-0 in to gpu-1's memory. I took the following root to do so, which involved four steps:

After initializing array 'a' on gpu0,

step1: send data from gpu0 to cpu0 (using !acc update self() clause)
step2: send data from cpu0 to cpu1 (using MPI_SEND())
step3: receive data into cpu1 from cpu0 (using MPI_RECV())
step4: update gpu1 device memory (using !$acc update device() clause)

This works perfectly fine, but this looks like a very long route and I think there is a better way of doing this. I tried to read up on !$acc host_data use_device clause suggested in the following post, but not able to implement it:

Getting started with OpenACC + MPI Fortran program

I would like to know how !$acc host_data use_device can be used, to perform the task shown below in an efficient manner.

PROGRAM TOY_MPI_OpenACC
    
    implicit none
    
    include 'mpif.h'
    
    integer :: rank, nprocs, ierr, i, dest_rank, tag, from
    integer :: status(MPI_STATUS_SIZE)
    integer, parameter :: N = 10000
    double precision, dimension(N) :: a
    
    call MPI_INIT(ierr)
    call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
    call MPI_COMM_SIZE(MPI_COMM_WORLD,nprocs,ierr)
    
    print*, 'Process ', rank, ' of', nprocs, ' is alive'
    
    !$acc data create(a)
    
        ! initialize 'a' on gpu0 (not cpu0)
        IF (rank == 0) THEN
            !$acc parallel loop default(present)
            DO i = 1,N
                a(i) = 1
            ENDDO
        ENDIF
        
        ! step1: send data from gpu0 to cpu0
        !$acc update self(a)
        
        print*, 'a in rank', rank, ' before communication is ', a(N/2)
        
        
        IF (rank == 0) THEN
            
            ! step2: send from cpu0
            dest_rank = 1;      tag = 1999
            call MPI_SEND(a, N, MPI_DOUBLE_PRECISION, dest_rank, tag, MPI_COMM_WORLD, ierr)
            
        ELSEIF (rank == 1) THEN
            
            ! step3: recieve into cpu1
            from = MPI_ANY_SOURCE;      tag = MPI_ANY_TAG;  
            call MPI_RECV(a, N, MPI_DOUBLE_PRECISION, from, tag, MPI_COMM_WORLD, status, ierr)
            
            ! step4: send data in to gpu1 from cpu1
            !$acc update device(a)
        ENDIF
        
        call MPI_BARRIER(MPI_COMM_WORLD, ierr)
        
        
        print*, 'a in rank', rank, ' after communication is ', a(N/2)
    
    !$acc end data
    call MPI_BARRIER(MPI_COMM_WORLD, ierr)
END

compilation: mpif90 -acc -ta=tesla toycode.f90 (mpif90 from nvidia hpc-sdk 21.9)

execution : mpirun -np 2 ./a.out

score 2 · Accepted Answer · answered Dec 06 '21 at 18:57

Here's an example. Note that I also added some boiler-plate code to do the local node rank to device assignment. I also prefer to use unstructured data regions since they're better for more complex codes, but here they would be semantically equivalent to the structured data region that you used above. I have guarded the host_data constructs under a CUDA_AWARE_MPI macro since not all MPI have CUDA Aware support enabled. For these, you'd need to revert back to copying the data between the host and device before/after the MPI calls.

% cat mpi_acc.F90
PROGRAM TOY_MPI_OpenACC
    use mpi
#ifdef _OPENACC
    use openacc
#endif
    implicit none

    integer :: rank, nprocs, ierr, i, dest_rank, tag, from
    integer :: status(MPI_STATUS_SIZE)
    integer, parameter :: N = 10000
    double precision, dimension(N) :: a
#ifdef _OPENACC
    integer :: dev, devNum, local_rank, local_comm
    integer(acc_device_kind) :: devtype
#endif

    call MPI_INIT(ierr)
    call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
    call MPI_COMM_SIZE(MPI_COMM_WORLD,nprocs,ierr)
    print*, 'Process ', rank, ' of', nprocs, ' is alive'

#ifdef _OPENACC
! set the MPI rank to device mapping
! 1) Get the local node's rank number
! 2) Get the number of devices on the node
! 3) Round-Robin assignment of rank to device
     call MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, &
          MPI_INFO_NULL, local_comm,ierr)
     call MPI_Comm_rank(local_comm, local_rank,ierr)
     devtype = acc_get_device_type()
     devNum = acc_get_num_devices(devtype)
     dev = mod(local_rank,devNum)
     call acc_set_device_num(dev, devtype)
     print*, "Process ",rank," Using device ",dev
#endif

    a = 0
    !$acc enter data copyin(a)

        ! initialize 'a' on gpu0 (not cpu0)
        IF (rank == 0) THEN
            !$acc parallel loop default(present)
            DO i = 1,N
                a(i) = 1
            ENDDO
            !$acc update self(a)
        ENDIF

        ! step1: send data from gpu0 to cpu0
        print*, 'a in rank', rank, ' before communication is ', a(N/2)

        IF (rank == 0) THEN

            ! step2: send from cpu0
            dest_rank = 1;      tag = 1999
#ifdef CUDA_AWARE_MPI
            !$acc host_data use_device(a)
#endif
            call MPI_SEND(a, N, MPI_DOUBLE_PRECISION, dest_rank, tag, MPI_COMM_WORLD, ierr)
#ifdef CUDA_AWARE_MPI
            !$acc end host_data
#endif

        ELSEIF (rank == 1) THEN

            ! step3: recieve into cpu1
            from = MPI_ANY_SOURCE;      tag = MPI_ANY_TAG;
#ifdef CUDA_AWARE_MPI
            !$acc host_data use_device(a)
#endif
            call MPI_RECV(a, N, MPI_DOUBLE_PRECISION, from, tag, MPI_COMM_WORLD, status, ierr)
#ifdef CUDA_AWARE_MPI
            !$acc end host_data
#else
            ! step4: send data in to gpu1 from cpu1
            !$acc update device(a)
#endif
        ENDIF

        call MPI_BARRIER(MPI_COMM_WORLD, ierr)

       !$acc update self(a)
        print*, 'a in rank', rank, ' after communication is ', a(N/2)

    !$acc exit data delete(a)
    call MPI_BARRIER(MPI_COMM_WORLD, ierr)
END

% which mpif90
/proj/nv/Linux_x86_64/21.9/comm_libs/mpi/bin//mpif90
% mpif90 -V

nvfortran 21.9-0 64-bit target on x86-64 Linux -tp skylake
NVIDIA Compilers and Tools
Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
% mpif90 -acc -Minfo=accel mpi_acc.F90
toy_mpi_openacc:
     38, Generating enter data copyin(a(:))
     42, Generating Tesla code
         43, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     42, Generating default present(a(:))
     46, Generating update self(a(:))
     76, Generating update device(a(:))
     82, Generating update self(a(:))
     85, Generating exit data delete(a(:))
% mpirun -np 2 ./a.out
 Process             1  of            2  is alive
 Process             0  of            2  is alive
 Process             0  Using device             0
 Process             1  Using device             1
 a in rank            1  before communication is     0.000000000000000
 a in rank            0  before communication is     1.000000000000000
 a in rank            0  after communication is     1.000000000000000
 a in rank            1  after communication is     1.000000000000000
% mpif90 -acc -Minfo=accel mpi_acc.F90 -DCUDA_AWARE_MPI=1
toy_mpi_openacc:
     38, Generating enter data copyin(a(:))
     42, Generating Tesla code
         43, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     42, Generating default present(a(:))
     46, Generating update self(a(:))
     82, Generating update self(a(:))
     85, Generating exit data delete(a(:))
% mpirun -np 2 ./a.out
 Process             0  of            2  is alive
 Process             1  of            2  is alive
 Process             1  Using device             1
 Process             0  Using device             0
 a in rank            1  before communication is     0.000000000000000
 a in rank            0  before communication is     1.000000000000000
 a in rank            1  after communication is     1.000000000000000
 a in rank            0  after communication is     1.000000000000000

Just want to clarify the following: By using !$acc host_data use_device(a) around MPI_SEND and MPI_RECV calls, the data is being sent directly from gpu0 to gpu1 (avoiding the 4 step process stated in the question) since that directive is enabling the use of memory address corresponding to device. Am I correct? — Dumbledore Albus, Dec 07 '21 at 01:05
"host_data" states that the device address of a variable is used on the host within the region. So it's just passing in the device address of "a" to the MPI calls. CUDA Aware enabled MPI can then detect that the address is from a device and use GPU Direct RDMA to pass the memory directly between GPUs, both within a node and between nodes. — Mat Colgrove, Dec 07 '21 at 16:48

Inter GPU communication in MPI+OpenACC programming

1 Answers1