Does distributed training produce NN that is average of NNs trained within each distributed node?

Question

I'm currently sifting through a ton of material on distributed training for neural networks (training with backward propagation). And more I dig in to this material the more it appears to me that essentially every distributed neural neural network training algorithm is just a way to combine gradients produced by distributed nodes (typically done using average) with respect to constraints on execution environment (i.e. network topology, node performance equality, ...).

And all the the salt of underlying algorithms is concentrated around exploitation of assumptions on execution environment constraints with aim to reduce the overall lag and thus overall amount of time necessary to complete the training.

So if we're just combining gradients with distributed training using averaging of weights in some clever way then the whole process training is (more or less) equivalent to averaging of networks resulted by training within every distributed node.

If I'm right with things described above then I would like to try combining weights produced by distributed nodes by hand.

So my question is: How do you produce an average of two or more neural network weights using any mainstream technology such as tensorflow / caffe / mxnet / ...

Thank you in advance

EDIT @Matias Valdenegro

Matias I understand what you are saying: You mean that as soon as you apply the gradient new gradient will change and thus it is not possible to do the parallelization because old gradients has no relation to new updated weights. So real world algorithms evaluate gradients, average them and then apply them.

Now if you just expand parenthesis in this mathematical operation then you would notice that you can apply the gradients locally. Essentially there's no difference if you average the deltas (vectors) or averaging NN states (points). Please refer to diagram below:

Suppose that NN weights are a 2-D vector.

Initial state  = (0, 0)
Deltas 1       = (1, 1)
Deltas 2       = (1,-1)
-----------------------
Average deltas = (1, 1) * 0.5 + (1, -1) * 0.5 = (1, 0)
NN State       = (0, 0) - (1, 0) = (-1, 0)

Now the same result can be achieved if gradients were applied locally on a node and the central node would average the weights instead of deltas:

--------- Central node 0 ---------
Initial state  = (0, 0)
----------------------------------

------------- Node 1 -------------
Deltas 1       = (1, 1)
State 1        = (0, 0) - (1,  1) = (-1, -1)
----------------------------------

------------- Node 2 -------------
Deltas 2       = (1,-1)
State 2        = (0, 0) - (1, -1) = (-1,  1)
----------------------------------

--------- Central node 0 ---------
Average state  = ((-1, -1) * 0.5 + (-1,  1) * 0.5) = (-1, 0)
----------------------------------

So the results are the same...

No, Averaging of gradients is not the same as averaging of the weights. — Dr. Snoopy, Jun 30 '19 at 11:10
@MatiasValdenegro, I agree that averaging of gradients is not the same as averaging of weights, but the result of applying averaged gradients is the same as averaging of every network. Suppose that on every node we start training with the same set of weights then each of the nodes produces a set of deltas. There is no difference if deltas are first averaged and applied to initial network or applied to every network and then averaged. Or I'm missing something? — Lu4, Jun 30 '19 at 13:31
Gradient application cannot be parallelized, its exactly the same thing I said before, if you apply gradients you modify the weights, and you can't average the weights. — Dr. Snoopy, Jun 30 '19 at 14:28
@MatiasValdenegro I agree with what you say, I'm referring to other thing however. Please review my updated question. — Lu4, Jun 30 '19 at 15:17

Olivier Cruchant · Accepted Answer · 2019-07-02T01:56:37.783

The question in the title is different that the question in the body :) I'll answer both:

Title question: "Does distributed training produce NN that is average of NNs trained within each distributed node?"

No. In the context of model training with minibatch SGD, distributed training usually refers to data-parallel distributed training, which distributes the computation of the gradients of a mini-batch of records over N worker, and then produces an average gradient used to update central model weights, in async or sync fashion. Historically, the averaging happened in a separate process called the parameter server (historical default in MXNet and TensorFlow), but modern approaches use a more network-frugal, peer-to-peer ring-style all-reduce, democratized by Uber's Horovod extension, initially developed for TensorFlow but now available for Keras, PyTorch and MXNet too. Note that model-parallel distributed training (having different piece of a model hosted in different devices) also exists, but data parallel training is more common in practice, possibly because simpler to implement (distributing an average is easy) and because full models often fit comfortably in memory of modern hardware. However, model parallel training is occasionally seen for very large models, such as Google's GNMT.

Body question: "How do you produce an average of two or more neural network weights using any mainstream technology?"

This depends on each framework API, for example:

In TensorFlow: Tensorflow - Averaging model weights from restored models

In PyTorch: How to take the average of the weights of two networks?

In MXNet (dummy code assuming initialized gluon nn.Sequential() models with similar architecture):

# create Parameter dict storing model parameters
p1 = net1.collect_params()
p2 = net2.collect_params()
p3 = net3.collect_params()

for k1, k2, k3 in zip(p1, p2, p3):
    p3[k3].set_data(0.5*(p1[k1].data() + p2[k2].data()))

Thank you for your definitive answer. @Olivier_Cruchant I couldn't help noting that event though you don't agree with statement `distributed training produces NN that is average of NNs trained within each distributed node` you confirm it further in the text. Uber's Horovod and parameter server are just yielding average gradient of all minibatches for data or model-parallel training. I.e. they just produce average gradient. Applying average gradient to original NN is equivalent to computing average of NNs trained on the same set of minibatches that are used to compute gradient. — Lu4, Jul 02 '19 at 06:47
you're welcome! What you describe (model ensembling) may be equivalent to data parallel dist SGD for one single update with constant learning rate but will likely not be equivalent if multiple updates are applied or advanced optimizers used. — Olivier Cruchant, Jul 02 '19 at 07:23

Does distributed training produce NN that is average of NNs trained within each distributed node?

1 Answers1