2

I am using RDMA writes in my application and want to improve throughput.

Currently, I have a single thread using my queue pair. I was wondering, what is a more standard way (or what are the advantages of each):

  1. Creating more connections with the remote node (so multiple queue pairs) and load balancing my traffic across them
  2. Using multiple threads on ibv_post_send on the single QP?

Thank you!

Mihir Shah
  • 39
  • 1
  • 5

1 Answers1

2

All libibverbs APIs are thread-safe, so having multiple threads post to a single QP is obviously not a safety issue. That said, the concurrency is being handled somewhere along the stack, and it may have synchronization costs that outweigh the threading benefits.

In general, having a QP per core should be more performant. Multiple QPs are also able to extract parallelism within the NIC (not just the CPU). It's hard to make a blanket statement across NICs and drivers I think, as QPs also take up NIC SRAM, and the amount available varies. That should only be a concern if you go for an extremely large number of QPs though, not with 1 QP/core or some number in that range.

There are other things you can consider to improve your application throughput:

  1. You can also reconsider your application design. Larger messages are much more efficient than smaller messages if you want to achieve line rate. Can you batch the data you're sending into larger buffers?

  2. If the communication thread is also doing some compute for each message, that's cycles diverted from the communication. Can you separate out the compute into its own thread? The answer is not always yes - if your compute kernel is tiny enough the cost of inter-thread synchronization can exceed the benefits of offloading it to a separate thread.

Ankush Jain
  • 487
  • 2
  • 14
  • Since "anything involving a NIC" is probably mostly *"I/O-bound,"* I would not predict that "multiple (CPU ...) threads" would be worth pursuing. – Mike Robinson Aug 31 '22 at 18:07
  • @MikeRobinson If the OP is working with NICs that support RDMA-like features, these can be O(100 Gbps) and O(10Mpps). It's impossible to saturate their capabilities with a single thread, and to the extent it is possible with extreme amounts of optimization, it is much more practical to just throw more threads at the problem. – Ankush Jain Sep 01 '22 at 01:23