5

The MPI-3 standard introduces shared-memory, that can be read and written by all processes sharing this memory without using calls to the MPI library. While there are examples of one-sided communications using shared or non-shared memory, I did not find much information about how to use shared memory correctly with direct access.

I ended up doing something like this, which works well, but I was wondering if the MPI standard guarantees that it will always work?

// initialization:
MPI_Comm comm_shared;
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, i_mpi, MPI_INFO_NULL, &comm_shared);

// allocation
const int N_WIN=10;
const int mem_size = 1000*1000;
double* mem[10];
MPI_Win win[N_WIN];
for (int i=0; i<N_WIN; i++) {   // I need several buffers.
    MPI_Win_allocate_shared( mem_size, sizeof(double), MPI_INFO_NULL, comm_shared, &mem[i], &win[i] );
    MPI_Win_lock_all(0, win);
}

while(1) {
    MPI_Barrier(comm_shared);
    ... // write anywhere on shared memory
    MPI_Barrier(comm_shared);
    ... // read on shared memory written by other processes
}

// deallocation
for (int i=0; i<N_WIN; i++) {
    MPI_Win_unlock_all(win[i]);
    MPI_Win_free(&win[i]);
}

Here, I ensure synchronization by using MPI_Barrier() and assume the hardware makes the memory view consistent. Furthermore, because I have several shared windows, a single call to MPI_Barrier seems more efficient than calling MPI_Win_fence() on every shared memory window.

It seems to work well an my x86 laptops and servers. But is this programm a valid/correct MPI program? Is there a more efficient method of achieving the same thing?

nat chouf
  • 736
  • 5
  • 10

2 Answers2

5

There are two key issues here:

  1. MPI_Barrier is absolutely not a memory barrier and should never be used that way. It may synchronize memory as a side-effect of its implementation in most cases, but users can never assume that. MPI_Barrier is only guaranteed to synchronize process execution. (If it helps, you can imagine a system where MPI_Barrier is implemented using a hardware widget that does not more than the MPI standard requires. IBM Blue Gene sort of did this in some cases.)
  2. This question is unanswerable without details on what you are actually doing with shared-memory here:
while(1) {
    MPI_Barrier(comm_shared);
    ... // write anywhere on shared memory
    MPI_Barrier(comm_shared);
    ... // read on shared memory written by other processes
}

It may not be written clearly, but it was assumed by the authors of the relevant text of the MPI-3 standard - I was part of this group - that one could reason about shared-memory using the memory model of the underlying/host language. Thus, if you are writing this code in C11, you can reason about it according to the C11 memory model.

If you want to use MPI to synchronize shared memory, then you should use MPI_Win_sync on all the windows for load-store accesses and MPI_Win_flush for RMA operations (Put/Get/Accumulate/Get_accumulate/Fetch_and_op/Compare_and_swap).

I expect MPI_Win_sync to be implemented as a CPU memory barrier, so it is redundant to call it for every window. This is why it may be more effective to assume C11 or C++11 memory models and use https://en.cppreference.com/w/c/atomic/atomic_thread_fence and https://en.cppreference.com/w/cpp/atomic/atomic_thread_fence, respectively.

Jeff Hammond
  • 5,374
  • 3
  • 28
  • 45
  • Thank you very much for your answer. May I thus assume that, within a hybrid MPI-OpenMP program, something as `#pragma omp barrier` followed by `MPI_Barrier(comm_shared);` and another `#pragma omp barrier` might do the trick ? (If I understood correctly, `#pragma omp barrier` is also a memory barrier). – nat chouf Mar 10 '20 at 16:59
  • `#pragma omp barrier` is primarily a thread execution barrier but implies a memory barrier (i.e. `#pragma omp flush`). While in practice `#pragma omp barrier` is sufficient, technically, it only applies within the context of OpenMP. I know of no such case, but one could build a system where OpenMP would not synchronize interprocess load-store operations. I'm sorry to be difficult here, but I am a "HPC language lawyer" of sorts. – Jeff Hammond Mar 10 '20 at 19:36
  • Could you elaborate on the use of `atomic_thread_fence()` ? Do you suggest I could use `MPI_Barrier()` together with `atomic_thread_fence()` to replace `MPI_Win_flush()` ? If so, should I put the fence before or after the barrier? or on both sides? – nat chouf Mar 11 '20 at 08:54
  • Flush works fine but is overkill. I doubt you’ll detect the difference in cost though. It’s relatively cheap to flush an empty RMA queue. – Jeff Hammond Mar 13 '20 at 04:32
  • Yeah, sorry, I meant `MPI_Win_sync()`. So should I put `atomic_thread_fence()` on both sides of the `MPI_Barrier()` to replace `MPI_Win_sync()` ? – nat chouf Mar 13 '20 at 11:08
  • I would not replace `MPI_Win_sync` with `atomic_thread_fence` unless you are using C(++)1z atomics. – Jeff Hammond Mar 13 '20 at 16:11
1

I would be tempted to say this MPI program is not valid.

To explain what I base my opinion on

  • In the description of MPI_Win_allocate_shared:

The consistency of load/store accesses from/to the shared memory as observed by the user program depends on the architecture. A consistent view can be created in the unified memory model (see Section 11.4) by utilizing the window synchronization functions (see Section 11.5) or explicitly completing outstanding store accesses (e.g., by calling MPI_WIN_FLUSH). MPI does not define semantics for accessing shared memory windows in the separate memory model.

  • Section 11.4, about the memory models, which states:

In the RMA unified model, public and private copies are identical and updates via put or accumulate calls are eventually observed by load operations without additional RMA calls. A store access to a window is eventually visible to remote get or accumulate calls without additional RMA calls. These stronger semantics of the RMA unified model allow the user to omit some synchronization calls and potentially improve performance.

  • In the advice to users that follows only indicates:

If accesses in the RMA unified model are not synchronized (with locks or flushes, see Section 11.5.3), load and store operations might observe changes to the memory while they are in progress.

  • Section 11.7, semantic and correctness says:

MPI_BARRIER provides process synchronization, but not memory synchronization.

  • The different examples in 11.8 explain well how to use flush and sync operations.

The only synchronization ever addressed is always and only one-sided ones, i.e. in your case, MPI_Win_flush{,_all}, or MPI_Win_unlock{,_all} (except the mutual exclusion of active and passive concurrent synchronization that has to be enforced by the user, or the usage of MPI_MODE_NOCHECK assert flag).

So either you access directly memory with store, and you need to call MPI_Win_sync() on each of your windows before calling MPI_Barrier (as explained in example 11.10) to ensure synchronization, or you are doing RMA accesses and then you would have to call at least MPI_Win_flush_all before the second barrier to ensure the operations have been propagated. If you try to read using load operation, you may have to synchronize after the second barrier as well before doing so.

Another solution would be to unlock and re-lock between barriers, or to use Compiler and hardware specific notations could ensure the load occurs after the data is updated.

Jeff Hammond
  • 5,374
  • 3
  • 28
  • 45
Clément
  • 93
  • 1
  • 1
  • 8
  • Thank you for your answer. Looking at the documentation for MPI_Win_flush_all, it seems to be useful for RMA operations, which I thought where put, get or accumulate calls. I'm not sure this applies to direct access in a shared memory window. I find the standard a bit vague about that... – nat chouf Feb 19 '20 at 21:47
  • From the understanding of what I read, you can do direct accesses, but then you would have to refer to examples 11.7 and 11.9. You need to call `MPI_Win_sync` after the "before-reading" barrier, so your local view of the shared-buffer is updated before reading, and to use an `MPI_Win_sync` after all writing have been done, to update your "public copy" of the window. Or simply call `MPI_Win_unlock_all` before the barrier and `MPI_Win_lock_all` after. You may improve the performance with the rights hints/assert though (`MPI_MODE_NOCHECK` as an example). – Clément Feb 20 '20 at 10:01
  • 1
    One should never unlock and relock with MPI-3. Flush is equivalent to toggling an epoch. – Jeff Hammond Mar 06 '20 at 19:43
  • @Jeff I didn't know that. Why is that? Isn't the whole point of the passive synchronization to allow the asynchronous lock-modification-unlock of remote memory? As for the flush toggling an epoch, it doesn't enforce the memory synchronization, does it? Or if so, what is the point of MPI_Sync? – Clément Mar 09 '20 at 14:19
  • `MPI_Win_flush` is specified to be equivalent to `MPI_Win_unlock; MPI_Win_lock`. Flush and Unlock synchronize RMA operations, which include direct access. `MPI_Win_sync` synchronizes the public window (used for direct access) and the private window (used for RMA). In the unified memory model, one gets eventual consistency between these, but `MPI_Win_sync` makes that immediate. These is a *super* complicated topic and probably warrants a separate Q&A. But please read https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node289.htm and related. – Jeff Hammond Mar 10 '20 at 20:08
  • http://wgropp.cs.illinois.edu/courses/cs598-s16/lectures/lecture35.pdf may be useful. That content is aligned with the understanding of the authors of the RMA chapter of MPI 3.0. – Jeff Hammond Mar 10 '20 at 20:08
  • My question ("Why is that?") was about the call to the sequence `MPI_Win_unlock; MPKI_Win_lock` being forbidden that surprised me. – Clément Mar 12 '20 at 09:20
  • I understand that if a call to `MPI_Win_flush` is **strictly** equivalent to `MPI_Win_unlock; MPI_Win_lock` then it does the memory synchronization, but the definition of the function only defines `MPI_Win_flush` as executing all pending RMA operations. `MPI_Win_sync`, however, would be the memory synchronization, to manage direct memory access (load/store). – Clément Mar 12 '20 at 09:35
  • However, in semantic and correctness, in the user rationale about UM it says "In the unified memory model, in the case where the window is in shared memory, SYNC can be used to order store operations and make store updates to the window visible to other processes and threads. Use of this routine is necessary […] when point-to-point, collective, or shared memory synchronization is used in place of an RMA synchronization routine. SYNC should be called by the writer before the non-RMA synchronization operation and by the reader after the non-RMA synchronization, as shown in Example 11.21." – Clément Mar 12 '20 at 09:42