I'm currently sifting through a ton of material on distributed training for neural networks (training with backward propagation). And more I dig in to this material the more it appears to me that essentially every distributed neural neural network training algorithm is just a way to combine gradients produced by distributed nodes (typically done using average) with respect to constraints on execution environment (i.e. network topology, node performance equality, ...).
And all the the salt of underlying algorithms is concentrated around exploitation of assumptions on execution environment constraints with aim to reduce the overall lag and thus overall amount of time necessary to complete the training.
So if we're just combining gradients with distributed training using averaging of weights in some clever way then the whole process training is (more or less) equivalent to averaging of networks resulted by training within every distributed node.
If I'm right with things described above then I would like to try combining weights produced by distributed nodes by hand.
So my question is: How do you produce an average of two or more neural network weights using any mainstream technology such as tensorflow / caffe / mxnet / ...
Thank you in advance
EDIT @Matias Valdenegro
Matias I understand what you are saying: You mean that as soon as you apply the gradient new gradient will change and thus it is not possible to do the parallelization because old gradients has no relation to new updated weights. So real world algorithms evaluate gradients, average them and then apply them.
Now if you just expand parenthesis in this mathematical operation then you would notice that you can apply the gradients locally. Essentially there's no difference if you average the deltas (vectors) or averaging NN states (points). Please refer to diagram below:
Suppose that NN weights are a 2-D vector.
Initial state = (0, 0)
Deltas 1 = (1, 1)
Deltas 2 = (1,-1)
-----------------------
Average deltas = (1, 1) * 0.5 + (1, -1) * 0.5 = (1, 0)
NN State = (0, 0) - (1, 0) = (-1, 0)
Now the same result can be achieved if gradients were applied locally on a node and the central node would average the weights instead of deltas:
--------- Central node 0 ---------
Initial state = (0, 0)
----------------------------------
------------- Node 1 -------------
Deltas 1 = (1, 1)
State 1 = (0, 0) - (1, 1) = (-1, -1)
----------------------------------
------------- Node 2 -------------
Deltas 2 = (1,-1)
State 2 = (0, 0) - (1, -1) = (-1, 1)
----------------------------------
--------- Central node 0 ---------
Average state = ((-1, -1) * 0.5 + (-1, 1) * 0.5) = (-1, 0)
----------------------------------
So the results are the same...