Fellow RDMA hackers, does anyone know if rdma_get_recv_comp(), which calls __ibv_get_cq_event() ever time out?
My problem is with the same programs as shown here: RDMA program randomly hangs
It works fine, but it's not robust against random client disconnects. Specifically, if I forcefully kill the client, then the server gets stuck in rdma_get_recv_comp() / ipv_get_cq_event().
This is for a Mellanox ConnectX-3 and I checked that the default timeout is 2.14s and retries = 1. But I'm not clear if ibv_get_cq_event() in blocking mode will even time out. The explanation of timeout in the ibv_modify_qp() documentation seems to suggest timeouts only apply for sends (rdma_get_send_comp()) since only senders wait for ACKs. But I don't see any difficulty in allowing receives to have a timeout too.
If hanging inside rdma_get_recv_comp() is expected in this case, how can I avoid it or implement a time out?
Some possibilities:
change my client shutdown sequence so that it performs all the necessary sends so that it won't leave rdma_get_recv_comp() on the server hanging?
replace rdma_get_recv_comp() with a loop that polls for receive completions