Tensorflow Mirror Strategy and Horovod Distribution Strategy

Question

I am trying to understand what are the basic difference between Tensorflow Mirror Strategy and Horovod Distribution Strategy.

From the documentation and the source code investigation I found that Horovod (https://github.com/horovod/horovod) is using Message Passing Protocol (MPI) to communicate between multiple nodes. Specifically it uses all_reduce, all_gather of MPI.

From my observation (I may be wrong) Mirror Strategy is also using all_reduce algorithm (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/distribute).

Both of them are using data-parallel, synchronous training approach. So I am a bit confused how they are different? Is the difference only in implementation or there are other (theoretical) difference?

And how is the performance of mirror strategy compared to horovod?

Take a look https://www.logicalclocks.com/goodbye-horovod-hello-tensorflow-collectiveallreduce/ — Sharky, Mar 18 '19 at 17:00

score 0 · Answer 1 · answered Oct 06 '20 at 00:06

0

Mirror Strategy has its own all_reduce algorithm which use remote procedural calls (gRPC) under the hood.

Like you mentioned Horovod uses MPI/GLOO to communicate between multiple processes.

answered Oct 06 '20 at 00:06

Ashiq Imran

2,077
19
17

score 0 · Answer 2 · answered Oct 06 '20 at 04:15

Regarding the performance, one of my colleagues have performed experiments before using 4 Tesla V100 GPUs using the codes from here. The results suggested that 3 settings work the best: replicated with all_reduce_spec=nccl, collective_all_reduce with properly tuned allreduce_merge_scope (e.g. 32), and horovod. I did not see significant differences among these 3.

Tensorflow Mirror Strategy and Horovod Distribution Strategy

2 Answers2