Performance of MPI_Reduce vs (MPI_Gather + Reduction on Root)

Question

CRAY supercomputer using the MPICH2 library. Each node has 32 CPU's.

I have a single float on N different MPI ranks, where each of these ranks is on a different node. I need to perform a reduction operation on this group of floats. I would like to know whether an MPI_Reduce is faster than MPI_Gather with the reduction calculated on the root, for any value of N. Please assume that the reduction done on the root rank will be done using a good parallel reduction algorithm that can utilize N threads.

If it isn't faster for any value of N, would it tend to be true for smaller N, like 16, or larger N?

If it is true, why? (For example, will MPI_Reduce use a tree communication pattern that tends to hide the reduction operation's time in the approach it uses to communicate with the next level of the tree?)

score 3 · Accepted Answer · answered Apr 25 '18 at 07:50

3

Assume that MPI_Reduce is always faster than MPI_Gather + local reduce.

Even if there was a case of N where reduction is slower than gather, an MPI implementation could easily implement reduction in this case in terms of gather + local reduce.

MPI_Reduce has only advantages over MPI_Gather + local reduce:

MPI_Reduce is the more high-level operation giving the implementation more opportunity to optimize.
MPI_Reduce needs to allocate much less memory
MPI_Reduce needs to communicate less data (if using a tree) or less data over the same link (if using direct all-to-one)
MPI_Reduce can distribute the computation across more resources (e.g. using a tree communication pattern)

That said: Never assume anything about performance. Measure.

answered Apr 25 '18 at 07:50

Zulan

21,896
6
49
109

1

Great answer. I'd just add an old paper that proposes some general principles behind MPI performance and measures them: Self-consistent MPI performance guidelines JL Traff, WD Gropp, R Thakur, https://pdfs.semanticscholar.org/0dfc/b8b616adea2151ed87337008b1f973ee54c9.pdf – ang mo Apr 25 '18 at 13:04
@angmo great find! At first I thought rule (24) would contradict my assumption - however I think `n` means total size of the gathered array as well as size of data for each reduction - so it does make sense. I would therefore add `MPI_Reduce(n) <= MPI_Gather(n/p)`. – Zulan Apr 25 '18 at 14:06
indeed, (24) is formed in a tricky way (and I wasn't fully aware of this until your comment, thanks!). They say "A consecutive block of data can gathered from each process by summing contributions of size ni with all processes except i contributing blocks of ni zeroes" - so it's like they compare gather with a very particular reduction... – ang mo Apr 25 '18 at 14:49
Thank you for that clear and logically consistent answer, @Zulan! Of course, as you indicate, measure to confirm. Also, thank you @ang mo for the link to that paper. Very helpful. – wiowou Apr 25 '18 at 23:29
@Zulan: After reading rule 24 more closely, it seems that MPI_Gather(n) <= MPI_Reduce(n) because I can always pad the elements on process i with the neutral element and perform an MPI_Reduce(n) instead of an MPI_Gather(n) if MPI_Reduce(n) were faster than MPI_Gather(n). Example: rank0=[1,2], rank1 [4,7]. I use MPI_Reduce with MPI_SUM. If I read in values so that rank0=[1,2,0,0] and rank1=[0,0,4,7], MPI_Reduce would give me [1,2,4,7] on rank0 faster than MPI_Gather would give me [1,2,4,7] on rank0 if MPI_Reduce(n) <= MPI_Gather(n). Must be that MPI_Gather(n)<=MPI_Reduce(n). – wiowou Apr 26 '18 at 00:33
@Zulan, would an MPI_Implementation actually do a parallel reduction on a single rank per your answer, 2nd sentence? Is MPI even aware of which ranks are actually cores of a multiprocessor and which are on separate CPU's? Also, bullet 1: yes MPI_Reduce is more high level but it is a more general case of MPI_Gather, per my explanation above. – wiowou Apr 26 '18 at 00:54
@wiowou, yes that's exactly how I read rule 24. But again for you the rule is `MPI_Reduce(n/p) <= MPI_Gather(n)` (I swapped the numbers before) or even more precsise `MPI_Reduce(1) <= MPI_Gather(p)`. – Zulan Apr 26 '18 at 07:37
MPI implementations are aware of nodes within shared memory to avoid the network stack when communicating locally. I'm not sure if the reduction is done locally in practice - that's where the don't assume measure part comes in. Practically there's also the aspect that it may not matter for your application. To satisfy the curiosity: many common MPI implementations are Open Source so you can look at it :-). – Zulan Apr 26 '18 at 07:42
@Zulan, thank you once again! It is good to know that MPI is aware of local CPU's. I will check, but I think CrayMPI may not be open. – wiowou Apr 27 '18 at 11:26

Performance of MPI_Reduce vs (MPI_Gather + Reduction on Root)

1 Answers1

Linked