I got confused about the two concepts: In-graph replication
and Between-graph replication
when reading the Replicated training in tensorflow's official How-to.
It's said in above link that
In-graph replication. In this approach, the client builds a single tf.Graph that contains one set of parameters (in tf.Variable nodes pinned to /job:ps); ...
Does this mean there are multiple
tf.Graph
s inBetween-graph replication
approach? If yes, where are the corresponding codes in the provided examples?While there is already a
Between-graph replication
example in above link, could anyone provide aIn-graph replication
implementation (pseudo code is fine) and highlight its main differences fromBetween-graph replication
?Thanks in advance!
Edit_1: more questions
Thanks a lot for your detailed explanations and gist code @mrry @YaroslavBulatov ! After looking your responses, I have the following two questions:
There is the following statement in Replicated training:
Between-graph replication. In this approach, there is a separate client for each /job:worker task, typically in the same process as the worker task. Each client builds a similar graph containing the parameters (pinned to /job:ps as before using tf.train.replica_device_setter() to map them deterministically to the same tasks); and a single copy of the compute-intensive part of the model, pinned to the local task in /job:worker.
I have two sub-questions related to above words in bold.
(A) Why do we say each client builds similar graph, but not same graph? I wonder the graph built in each client in the example of Replicated training should be the same because below graph construction codes are shared within all
worker
s.:# Build model...
loss = ...
global_step = tf.Variable(0)
(B) Shouldn't it be multiple copies of compute-intensive part of the model, since we have multiple
workers
?Does the example in Replicated training support training on multiple machines, each of which has multiple GPUs? If not, can we use simultaneously both the
In-graph replication
to support training on multiple GPUs on each machine andBetween-graph replication
for cross-machine training? I ask this question because @mrry indicated that theIn-graph replication
is essentially same to the way used in CIFAR-10 example model for multiple GPUs.