0

I'm running a parallel application and it runs properly until sudden abort with the following massage from couple of cores:

[n18:mpi_rank_91][handle_cqe] Send desc error in msg to 103, wc_opcode=0
[n18:mpi_rank_91][handle_cqe] Msg from 103: wc.status=12, wc.wr_id=0xbc8d140, wc.opcode=0, vbuf->phead->type=0 = MPIDI_CH3_PKT_EAGER_SEND
[n18:mpi_rank_91][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:587: [] Got completion with error 12, vendor code=0x81, dest rank=103 : Numerical result out of range (34)

I'm new to MPI usage / debugging. My search didn't came-up with a definite conclusion (i.e., https://software.intel.com/en-us/node/535587); What are the above massages are referring to ? How to find a bug in a parallel (Fortran) code with such a massage ?

A follow-up question : If my application has an inner heavy block which part of the nodes are spending a growing amount of time in, how long do the nodes which finished up their task can wait for slower ones at the interface before Infiniband congestion is reached ?

Jacob
  • 59
  • 5
  • 1
    It could be an indication of a hardware problem with the underlying InfiniBand network. Try to increase the timeout by setting `MV2_DEFAULT_TIME_OUT` to `20` or something. – Hristo Iliev Oct 22 '15 at 06:44
  • @HristoIliev - Thanks for pointing this out, however trying `MV2_DEFAULT_TIME_OUT = 20 or 23` didn't worked for me -- I get the same error. The interesting point is that with no optimization (i.e., -O0) this error is somehow eliminated. One would think that if the reason involves with 'infiniband' congestion (i.e., [http://users.sdsc.edu/~glockwood/comp/faq.php]), the optimization would be in favour. Any other thought in the Fortran hard-coded level, rather than the underlying network ? – Jacob Oct 22 '15 at 15:27
  • Sorry, I'm out of ideas. Also, I'm an Open MPI user and have a very limited knowledge of MVAPICH2. – Hristo Iliev Oct 22 '15 at 17:24
  • This is almost certainly a bug in your code. Usually, turning off optimization fixing your issue is a sure sign of this. However, without seeing your code, there's really no way to know what the specific problem is. – NoseKnowsAll Oct 22 '15 at 21:17
  • Are you by any chance using non-blocking communication operations? – Hristo Iliev Oct 24 '15 at 09:04
  • @NoseKnowsAll and Hristo Iliev - Thanks for your replay ; I'm using the WRF model which as far as I know do not use non-blocking communication. Nevertheless, the MPI wrapper is in the main model, while I'm referring to inner modules, so my hard-coded module does not contain any MPI operators (i.e., calls). As for the optimization issue, I think you are right, I have only checked this briefly (and to make long story short) -- this is not the case, namely I get the above error with any optimization. – Jacob Oct 25 '15 at 10:34

0 Answers0