Is/Will there be a way to perform Fortran Coarray calls asynchronously?

Question

I've been falling in love with the ease-of-use of Fortran's Coarrays framework, because of how clean it is compared to lower level APIs like MPI.

But one thing I haven't been able to tease out is whether there is a way to know how to explicitly tell Fortran to perform puts and gets asynchronously. The benefit of this would be to replicate MPI's MPI_I* call, which allow overlapping communication and computation.

The reason why I'm interested in overlapping is for performance reasons. The particular application I have in mind is in CFD with particle methods, where the domain is subdivided and halo particles are exchanged every time-step. Using MPI p2p calls, which I'm currently more familiar with, I'm initiating the exchange of particle information between processes and then performing computation while the communications are completing, kind of like:

do pid = 0, numprocs-1

   if (pid /= procid) then
      ! post sends
      call MPI_ISEND(neighbours(pid+1)%sendbuff, &
                     neighbours(pid+1)%n_send, &
                     particle_derived_type, &
                     pid, &
                     0, &
                     MPI_COMM_WORLD, &
                     request(pid+1), &
                     ierr)

      ! post receives
      call MPI_IRECV(neighbours(pid+1)%recvbuff, &
                     neighbours(pid+1)%n_recv, &
                     particle_derived_type, &
                     pid+1, &
                     0, &
                     MPI_COMM_WORLD, &
                     request(numprocs+pid+1), &
                     ierr)
   end if
end do

! do some heavy computation

call MPI_WAITALL(2*numprocs, request, status, ierr)

This is just for demonstration. In reality, each process would only communicate information with its neighbour processes and not all of them. The advantage of using MPI_ISEND/RECV here is that I don't have to worry about locking and that I can so some computation while the sends and receives are being completed.

A kind of equivalent example using Coarrays:

do pid = 1, numprocs

   if (i /= this_image()) then
      ! put data into remote neighbour images
      n_send = neighbours(pid)%n_send
      neighbours(this_image())[pid]%recv_buff(1:n_send) = neighbours(pid)%send_buff(1:n_send)
   end if

end do

! do some heavy computation

sync all

which is cool, because it's much more compact. But I'm not sure whether the "puts" return after initiating the transfer like with MPI_ISEND/RECV.

So for this example, I'm interested in replicating MPI_I* ability to overlap communications with computation in Fortran Coarrays, as it is pretty important in optimising the performance of CFD simulations.

EDIT Hopefully clearer explanation of why I want to overlap comms with comps.

So that we can understand your problem formally, and explain equally, can you state your concerns with reference to images and segments? — francescalus, Mar 30 '23 at 22:49
Is there a reason you are comparing the coarray model with `MPI_ISend` and not `MPI_Put`? — francescalus, Mar 30 '23 at 23:04
@francescalus I used the MPI_ISEND in the example since that's what I'm more familiar with. It seems that MPI_GET and MPI_PUT are both non-blocking. Do you know if this also true of equivalent operations in Coarrays? I'm not sure I understand what you mean by images and segments though. I guess I just want to know if there are non-blocking equivalents to `co_*` calls and put/get operations. — Edward Yang, Mar 31 '23 at 01:03
Segments are a fundamental building block of the execution model of a Fortran program. Trying to understand them by relating them to point-to-point MPI communication is likely to be challenging. Collective (coarray) subroutines are essentially by definition blocking, but something like `x[2]=5` is more subtle. Definition and referencing of coarrays are subject to restrictions based on segment ordering, and it's going to be hard for me to write a clear answer in this format (rather than a tutorial) without assuming a solid foundation in coarrays rather than p2p MPI. — francescalus, Mar 31 '23 at 02:26
One of the important use cases for MPI_ISend and MPI_IRecv is the ability to execute them *both at the same rank*, without the if and *without a deadlock*. In fact executing many of them. And then just waiting for all the requests. Coarrays work differently, with segments and syncs, rather than with blocking and non-blocking message passing. — Vladimir F Героям слава, Mar 31 '23 at 05:51
@francescalus thanks for the explanation. I'm interested in the `MPI_I*` calls ability to return immediately not for segment ordering reasons. My CFD program is simple enough to ensure image segments are ordered with `sync all`. However, I'm interested in `MPI_I*`'s ability to return immediately because computation can be done while waiting for comms to complete - a fairly common strategy to optimise distributed memory CFD simulations. I've updated the question with a contrived example that hopefully demonstrates what I'm trying to do. — Edward Yang, Apr 03 '23 at 10:34
@VladimirFГероямслава thanks for the explanation. As i mention in my response to francescalus above, I'm not necessarily interested in avoiding deadlocks, but I'm more interested in trying to decrease the time spent waiting for comms as a proportion of total run time. — Edward Yang, Apr 03 '23 at 10:38

score 1 · Answer 1 · answered Apr 06 '23 at 17:21

The coarray communication model is one of remote memory access/one-sided communication, not one of point-to-point.

In the assignment statement in

integer i
i = 3
print *, i

end program

i is set to 3, "immediately". The reference in the print statement happens similarly.

One doesn't question whether the "put" and "get" happen with blocking, synchronously or asynchronously.

Consider now

integer, volatile :: i
i = 3
print *, i

end program

In the first example the processor may decide against storing to/reading from a permanent memory location for the value of i. In the second example, the value of i must be fetched rather than assumed.

When there is more than one image involved we see similar:¹

integer i[*]

if (this_image()==1) then
  i = 1
  i[2] = 3
end if

sync all

print *, i

end program

Here, image 1 has two assignments, setting the value of i on each of the two images. Both happen "immediately".

As soon as i=1 is executed, the value of i on the first image is 1. As soon as i[2]=3 is executed the value of i on the second image is 3.

Now, "blocking" in this second assignment (in particular) comes down to what it means for the assignment to complete.

There are two extremes of conversations that may be had:

Image 1: Hey, Image 2, you there?

Image 2: Sup?

Image 1: I'd like to set your value of i to be equal to 3.

Image 2: I'll get on it right after I've finished what I'm doing.

Image 1: No worries, I'll grab a coffee while you do it.

... time passes

Image 2: Wotcha, Image 1.

Image 1: Hello?

Image 2: I've done what you wanted. My i is now equal to 3.

Image 1: Great, thanks. I'll get back to work.

can be compared with

Image 1: Image 2, your value of i is now 3.

The Fortran standard does not say which of those conversations happens, but it leaves it open for the second to be the one that does. This second conversation does not even have to happen around the time of the assignment statement (Fortran 2018, 11.6.2 Note 4):²

In practice [..] the processor could make a copy of a nonvolatile coarray on an image [..] and, as an optimization, defer copying a changed value back to the permanent memory location while it is still being used. Since the variable is not volatile, it is safe to defer this transfer [..]

That's all to say that the assignment could be blocking in some way but there's no requirement or incentive for it to be. Importantly, deadlock cannot occur in assignment even if all images are trying to assign to coarrays on all other images.

One-sided communication like this works by placing restrictions on interactions between communication partners.

Loosely, if one image sets the value of a (non-volatile) coarray, no other image is allowed to define or reference that coarray until some synchronization has happened.³ The image that sets the value is assured that until this synchronization, the coarray is exactly as this image has decided it to be.

Communication can happen exactly at the time of the assignment, or at some later time; computation can stall until the value has been transmitted, even acknowledged, or continue immediately. Fortran doesn't tell us this but a processor is allowed to defer or even eliminate communication, or overlap communication and computation. Compiler vendors are usually keen to have optimal behaviour.

The program

integer i[*]
i[2] = this_image()
print *, i[2]

end program

is not valid. The restrictions which make this type of program invalid allow for a range of implementation approaches. As a programmer you have little say in or knowledge of which implementation is used.

I've said nothing about atomic actions (including events) or collective actions.

¹ Examples that follow assume two images.

² If Image 1 makes several assignments to i[2], the image may choose to give just the final value to Image 2 in one conversation. Indeed, the second conversation can be avoided completely in some cases.

³ It's this restriction which allows us to say that the value of a coarray is affected immediately: a valid Fortran program cannot have a conflict about what the value should be.

Is/Will there be a way to perform Fortran Coarray calls asynchronously?

1 Answers1