My application uses one-sided communications (MPI_Rget, MPI_Raccumulate) with synchronization primitives like MPI_Win_Lock and MPI_Win_Unlock for its passive target synchronization.
I profiled my application and found that most of time is being spent in MPI_Win_Unlock function (not MPI_Win_Lock), which I cannot understand why.
(1) Does anyone know why MPI_Win_Unlock function takes so much time? (Maybe it's implementation issue) (2) Can this situation get better if I moves for S/C/P/W synchronization model? I just need to be sure that all the one-sided operations are not concurrently overlapped.
I am using Intel's MPI Library ver 5.1 which implements MPI V3.
I appended some snippets of my codes (actually it's all :D)
Each MPI process runs 'Run()'
Run ()
// Join
For each Target_Proc i in MPI_COMM_WORLD
RequestDataFrom ( (i + k) % nprocs ); // requests k-step away neighbor's data asynchronously
ConsumeDataFrom (i);
JoinWithMyData (my_rank, i);
WriteBackDataTo (i);
goto the above 'For loop' again if the termination condition does not hold.
MPI_Barrier(MPI_COMM_WORLD);
// Update Data in Window
UpdateMyWindow (my_rank);
RequstDataFrom (target_rank_id)
MPI_Win_Lock (MPI_LOCK_SHARED, target_rank_id, win)
MPI_Rget (from target_rank_id, win, &requests[target_rank_id])
MPI_Win_Unlock (target_rank_id, win)
ConsumeDataFrom (target_rank_id)
MPI_Wait (&requests[target_rank_id])
GetPointerToBuffer (target_rank_id)
WriteBackDataTo (target_rank_id)
MPI_Win_Lock (MPI_LOCK_EXCLUSIVE, target_rank_id, win)
MPI_Rput (from target_rank_id, win, &requests[target_rank_id])
MPI_Win_Unlock (target_rank_id, win)
UpdateMyWindow ()
MPI_Win_Lock (MPI_LOCK_EXCLUSIVE, target_rank_id, win)
Update()
MPI_Win_Unlock (target_rank_id, win)