7

I am trying to understand what are the basic difference between Tensorflow Mirror Strategy and Horovod Distribution Strategy.

From the documentation and the source code investigation I found that Horovod (https://github.com/horovod/horovod) is using Message Passing Protocol (MPI) to communicate between multiple nodes. Specifically it uses all_reduce, all_gather of MPI.

From my observation (I may be wrong) Mirror Strategy is also using all_reduce algorithm (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/distribute).

Both of them are using data-parallel, synchronous training approach. So I am a bit confused how they are different? Is the difference only in implementation or there are other (theoretical) difference?

And how is the performance of mirror strategy compared to horovod?

Md Kamruzzaman Sarker
  • 2,387
  • 3
  • 22
  • 38
  • 1
    Take a look https://www.logicalclocks.com/goodbye-horovod-hello-tensorflow-collectiveallreduce/ – Sharky Mar 18 '19 at 17:00

2 Answers2

0

Mirror Strategy has its own all_reduce algorithm which use remote procedural calls (gRPC) under the hood.

Like you mentioned Horovod uses MPI/GLOO to communicate between multiple processes.

Ashiq Imran
  • 2,077
  • 19
  • 17
0

Regarding the performance, one of my colleagues have performed experiments before using 4 Tesla V100 GPUs using the codes from here. The results suggested that 3 settings work the best: replicated with all_reduce_spec=nccl, collective_all_reduce with properly tuned allreduce_merge_scope (e.g. 32), and horovod. I did not see significant differences among these 3.

Minh Nguyen
  • 755
  • 5
  • 11