CRAY supercomputer using the MPICH2 library. Each node has 32 CPU's.
I have a single float on N different MPI ranks, where each of these ranks is on a different node. I need to perform a reduction operation on this group of floats. I would like to know whether an MPI_Reduce is faster than MPI_Gather with the reduction calculated on the root, for any value of N. Please assume that the reduction done on the root rank will be done using a good parallel reduction algorithm that can utilize N threads.
If it isn't faster for any value of N, would it tend to be true for smaller N, like 16, or larger N?
If it is true, why? (For example, will MPI_Reduce use a tree communication pattern that tends to hide the reduction operation's time in the approach it uses to communicate with the next level of the tree?)