I would like to ask you for an explanation what are the "InfiniBand-Stacks". Those were recently changed on our machine and I started running into MPI communication failures. I need some information in order to understand how this might be affecting the stability of my parallel jobs.
The actual error message I got was :
A process failed to create a queue pair. This usually means either the device has run out of queue pairs (too many connections) or there are insufficient resources available to allocate a queue pair (out of memory). The latter can happen if either 1) insufficient memory is available, or 2) no more physical memory can be registered with the device.
[connect/btl_openib_connect_oob.c:867:rml_recv_cb] error in endpoint reply start connect