I'm running a parallel application and it runs properly until sudden abort with the following massage from couple of cores:
[n18:mpi_rank_91][handle_cqe] Send desc error in msg to 103, wc_opcode=0
[n18:mpi_rank_91][handle_cqe] Msg from 103: wc.status=12, wc.wr_id=0xbc8d140, wc.opcode=0, vbuf->phead->type=0 = MPIDI_CH3_PKT_EAGER_SEND
[n18:mpi_rank_91][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:587: [] Got completion with error 12, vendor code=0x81, dest rank=103 : Numerical result out of range (34)
I'm new to MPI usage / debugging. My search didn't came-up with a definite conclusion (i.e., https://software.intel.com/en-us/node/535587); What are the above massages are referring to ? How to find a bug in a parallel (Fortran) code with such a massage ?
A follow-up question : If my application has an inner heavy block which part of the nodes are spending a growing amount of time in, how long do the nodes which finished up their task can wait for slower ones at the interface before Infiniband
congestion is reached ?