0

I have a problem in this part of code (which is common between the tasks):

for (i = 0; i < m; i++) {
    // some code
    MPI_Reduce(&res, &mn, 1, MPI_INT, MPI_MIN, 0, MPI_COMM_WORLD);
    // some code
}

This is working fine, but for large values of m I get this error:

    Fatal error in PMPI_Reduce: Other MPI error, error stack:
    PMPI_Reduce(1198).........................: MPI_Reduce(sbuf=008FFC80, rbuf=008FFC8C, count=1, MPI_INT, MPI_MIN, root=0, MPI_COMM_WORLD) failed
    MPIR_Reduce(764)..........................:
    MPIR_Reduce_binomial(207).................:
    MPIC_Send(41).............................:
    MPIC_Wait(513)............................:
    MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
    MPIDI_CH3I_Progress_handle_sock_event(436):
    MPIDI_CH3_PktHandler_EagerShortSend(306)..: Failed to allocate memory for an unexpected message. 261895 unexpected messages queued.
    
    job aborted:
    rank: node: exit code[: error message]
    0: AmirDiab: 1
    1: AmirDiab: 1
    2: AmirDiab: 1: Fatal error in PMPI_Reduce: Other MPI error, error stack:
    PMPI_Reduce(1198).........................: MPI_Reduce(sbuf=008FFC80, rbuf=008FFC8C, count=1, MPI_INT, MPI_MIN, root=0, MPI_COMM_WORLD) failed
    MPIR_Reduce(764)..........................:
    MPIR_Reduce_binomial(207).................:
    MPIC_Send(41).............................:
    MPIC_Wait(513)............................:
    MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
    MPIDI_CH3I_Progress_handle_sock_event(436):
    MPIDI_CH3_PktHandler_EagerShortSend(306)..: Failed to allocate memory for an unexpected message. 261895 unexpected messages queued.
    3: AmirDiab: 1

Any advice?

Amir Diab
  • 3
  • 2
  • `MPI_INT` is not a match for `bool` (use `MPI_CXX_BOOL` in C++, see https://stackoverflow.com/questions/57598517/how-to-send-boolean-datatype-through-mpi-c/57600337) – Gilles Gouaillardet Jun 22 '21 at 23:48
  • 1
    thanks for your comment, `bool` is actually `int`, I've defined bool as int as I noted by writing `typedef int bool;` in my code. I'm sorry if this make confusing. I'll edit the question :) – Amir Diab Jun 23 '21 at 06:43
  • The root cause can be a consequence of a previous memory corruption. Can you edit your question with a [mcve] ? – Gilles Gouaillardet Jun 23 '21 at 06:52
  • What is the return value of MPI_Reduce? – kungjohan Jun 23 '21 at 07:33
  • I'm new to MPI and don't know how reduce actually work between tasks, do you mean that I should avoid using `MPI_Reduce` inside loops? @GillesGouaillardet – Amir Diab Jun 23 '21 at 07:38
  • I mean you should write a minimal program that 1) evidence the issue 2) can be compiled A snippet is unfortunately not helpful here. – Gilles Gouaillardet Jun 23 '21 at 08:05
  • it returns `0` until the error occur .. after that I can't print the returned value @kungjohan – Amir Diab Jun 23 '21 at 08:07
  • note the error message `Failed to allocate memory for an unexpected message. 261895 unexpected messages queued`. That either suggests a memory leak in your program, or an internal control flow issue. A workaround worth trying is to `MPI_Barrier(...)` every nth iterations (`10` should work but with a performance penalty, `100` should be a bit faster if it works) – Gilles Gouaillardet Jun 23 '21 at 08:54
  • You are right. The error disappeared but with slowing as you mentioned. Thank youu – Amir Diab Jun 24 '21 at 08:02

1 Answers1

0

You seem to be overtaxing MPI with your communication pattern. Note the 261895 unexpected messages queued error message. That's quite a lot of messages. As MPI tries to send data for small messages (like your single-element reductions) eagerly, running hundreds of thousands of MPI_Reduce calls in a loop can lead to resource exhaustion when too many messages are in flight.

If possible, try to re-arrange your algorithm so that you handle all m elements in a single reduction instead of iterating over them:

int* res = malloc(m * sizeof(int));
int* ms  = malloc(m * sizeof(int));

for (i = 0; i < m; ++i) {
    ms[i] = /* ... */
}

MPI_Reduce(res, ms, m, MPI_INT, MPI_MIN, 0, MPI_COMM_WORLD);

Alternatively, as suggested in the comments, you can add MPI_Barrier() calls every so often inside the loop to limit the number of outstanding messages.

dabo42
  • 486
  • 3
  • 6