Why MPI_Win_unlock is so low?

Question

My application uses one-sided communications (MPI_Rget, MPI_Raccumulate) with synchronization primitives like MPI_Win_Lock and MPI_Win_Unlock for its passive target synchronization.

I profiled my application and found that most of time is being spent in MPI_Win_Unlock function (not MPI_Win_Lock), which I cannot understand why.

(1) Does anyone know why MPI_Win_Unlock function takes so much time? (Maybe it's implementation issue) (2) Can this situation get better if I moves for S/C/P/W synchronization model? I just need to be sure that all the one-sided operations are not concurrently overlapped.

I am using Intel's MPI Library ver 5.1 which implements MPI V3.

I appended some snippets of my codes (actually it's all :D)

Each MPI process runs 'Run()'

Run ()
 // Join
 For each Target_Proc i in MPI_COMM_WORLD
  RequestDataFrom ( (i + k) % nprocs ); // requests k-step away neighbor's data asynchronously
  ConsumeDataFrom (i); 
  JoinWithMyData (my_rank, i);
  WriteBackDataTo (i);

 goto the above 'For loop' again if the termination condition does not hold. 
 MPI_Barrier(MPI_COMM_WORLD);

 // Update Data in Window
 UpdateMyWindow (my_rank);

RequstDataFrom (target_rank_id)
 MPI_Win_Lock (MPI_LOCK_SHARED, target_rank_id, win)
 MPI_Rget (from target_rank_id, win, &requests[target_rank_id])
 MPI_Win_Unlock (target_rank_id, win)

ConsumeDataFrom (target_rank_id)
 MPI_Wait (&requests[target_rank_id])
 GetPointerToBuffer (target_rank_id)

WriteBackDataTo (target_rank_id)
 MPI_Win_Lock (MPI_LOCK_EXCLUSIVE, target_rank_id, win)
 MPI_Rput (from target_rank_id, win, &requests[target_rank_id])
 MPI_Win_Unlock (target_rank_id, win)

UpdateMyWindow ()
 MPI_Win_Lock (MPI_LOCK_EXCLUSIVE, target_rank_id, win)
 Update()
 MPI_Win_Unlock (target_rank_id, win)

Try again with async progress enabled. The env var should be easy to find via Google. — Jeff Hammond, Jan 26 '16 at 02:05
@Jeff, thanks Jeff, I added some snippets. Setting 'MPICH_ASYNC_PROGRESS=1' does good job here. It reduces the time spent on 'MPI_Win_Unlock' by 50%. Now I become very suspicious about the term 'one-sided communication' in MPI standard... — syko, Jan 26 '16 at 02:27
MPI standard refuses to guarantee asynchrony. I have fought this but it will likely remain a quality-of-implementation issue. See Casper project from Argonne for how to do async progress more efficiently than with threads. Full disclosure: I'm a co-author and the primary user of Casper right now. — Jeff Hammond, Jan 26 '16 at 03:16
Btw you should not use Rget if you sync with Unlock immediately thereafter. Use Get for this. Use of Rget is limited to dataflow-type usage. — Jeff Hammond, Jan 26 '16 at 03:19

Patrick · Accepted Answer · 2016-01-26T01:53:40.270

3

The function MPI_Win_unlock will block until all RMA operations of the access epoch have been completed.

As such it is no surprise that your profiler will show that this function takes the majority of time. It will block till the MPI implementation has completed all one-sided communication operations that were posted since the corresponding MPI_Win_lock.

Note that one-sided operations (Put, Get, etc) will merely dispatch the operation and not block till the operation is completed. As such these operations are effectively very similar to non-blocking communication functions (MPI_Isend/MPI_Irecv) without the MPI_Request object. To continue the analogy, MPI_Win_unlock waits on all operations to complete, similar to a MPI_Wait_all.

edited Jan 26 '16 at 01:53

answered Jan 26 '16 at 01:47

Patrick

900
7
14

thanks for your explanation. That explains the numbers. Does that mean I am forcing synchronization too conservatively? Will it get improved if I use S/C/P/W synchronization primitives? – syko Jan 26 '16 at 01:53
Whether or not S/C/P/W synchronization will be faster depends a lot on your application. Active target synchronization with `MPI_Fence`, which allows all accesses in between fences, may be the easiest option. If you require stricter synchronization, S/C/P/W should be quicker if the communication pattern remains simple (e.g. each process communicates with a few neighbors only). If each process may communicate with many others, `lock`/`unlock` may be faster. Only testing the different options will allow you to identify the best performing one for your application ;) – Patrick Jan 26 '16 at 02:02
No. Do not use PSCW. If that's your pattern, use Send-Recv. – Jeff Hammond Jan 26 '16 at 02:04
Is PSCW really that bad? Would you say that Send/Recv is mostly always faster than PSCW? – Patrick Jan 26 '16 at 02:06
PSCW might be useful with accumulate ops but I don't know anyone trying to use this... – Jeff Hammond Jan 26 '16 at 03:22

Why MPI_Win_unlock is so low?

1 Answers1