There are alot of things going under the covers here. But one of the biggest bottlenecks in infiniband is the QP cache in the firmware.
The firmware has a very very small QP cache (of the order of 16 - 32) depending upon which adaptor you are using. When the number of active Qps exceeds this cache, then any benefit of using IB starts to degenerate.
From what I know, the performance penalty for a cache miss is of the order of mili seconds.. yes thats right.. milliseconds..
There are many other caches involved.
Ib has multiple different transports, with 2 most common being:
1. RC - Reliable Connected
2. UD - Unreliable Datagram
Reliable Connected mode is somewhat like TCP in that it requires an explicit connection, and is point 2 point between 2 processes. Each process allocates a QP (Queue Pair) which is similar to a socket in the ethernet world.
But QP is a much more expensive and resource than a socket for many different reasons.
UD : unreliable datagram mode is like UDP in that it does not need a connection. A sing UD Qp can talk to any number of remote UD Qps.
If your data model is 1 to many.. i.e 1 machine to many machines and you need a reliable connection with huge data sizes, then you are out of luck. IB starts losing some of its effectiveness.
If you have the resources to build a reliable layer on top, then use UD for getting scalability.
If you data model is 1 to many, but the many remote processes reside on the same machine, then you can use RDS (reliable Datagram service) which is a Socket interface to use Infiniband and multiplexes many connections over a single RC connections between 2 machines. (RDS has its own set of weird issues but its a start..)
There is a 3rd newish transport called XRC which mitigates some scalability issues as well but has its own caveats.