Distributed tensorflow: the difference between In-graph replication and Between-graph replication

Question

I got confused about the two concepts: In-graph replication and Between-graph replication when reading the Replicated training in tensorflow's official How-to.

It's said in above link that

In-graph replication. In this approach, the client builds a single tf.Graph that contains one set of parameters (in tf.Variable nodes pinned to /job:ps); ...

Does this mean there are multiple tf.Graphs in Between-graph replication approach? If yes, where are the corresponding codes in the provided examples?
While there is already a Between-graph replication example in above link, could anyone provide a In-graph replication implementation (pseudo code is fine) and highlight its main differences from Between-graph replication?

Thanks in advance!

Edit_1: more questions

Thanks a lot for your detailed explanations and gist code @mrry @YaroslavBulatov ! After looking your responses, I have the following two questions:

There is the following statement in Replicated training:

Between-graph replication. In this approach, there is a separate client for each /job:worker task, typically in the same process as the worker task. Each client builds a similar graph containing the parameters (pinned to /job:ps as before using tf.train.replica_device_setter() to map them deterministically to the same tasks); and a single copy of the compute-intensive part of the model, pinned to the local task in /job:worker.

I have two sub-questions related to above words in bold.

(A) Why do we say each client builds similar graph, but not same graph? I wonder the graph built in each client in the example of Replicated training should be the same because below graph construction codes are shared within all workers.:

# Build model...

loss = ...

global_step = tf.Variable(0)

(B) Shouldn't it be multiple copies of compute-intensive part of the model, since we have multiple workers?
Does the example in Replicated training support training on multiple machines, each of which has multiple GPUs? If not, can we use simultaneously both the In-graph replication to support training on multiple GPUs on each machine and Between-graph replication for cross-machine training? I ask this question because @mrry indicated that the In-graph replication is essentially same to the way used in CIFAR-10 example model for multiple GPUs.

here's [example](https://gist.github.com/yaroslavvb/ef407a599f0f549f62d91c3a00dcfb6c) of `in-graph replication` -- basically you have single graph and use `with tf.device("/../worker/task:1/..)` to assign ops to workers — Yaroslav Bulatov, Jan 11 '17 at 21:02
multiple `tf.Graph` objects are there because you have multiple processes, and a new `tf.Graph` object is created the first time you use default graph in each process — Yaroslav Bulatov, Jan 11 '17 at 21:03
mirry， can you help to answer this question？ https://stackoverflow.com/questions/44826477/does-tf-train-syncreplicasoptimizer-do-complete-parameter-update-from-aggregated — Yanfei Wang, Jun 29 '17 at 13:30

mrry · Accepted Answer · 2017-01-11T23:18:22.563

First of all, for some historical context, "in-graph replication" is the first approach that we tried in TensorFlow, and it did not achieve the performance that many users required, so the more complicated "between-graph" approach is the current recommended way to perform distributed training. Higher-level libraries such as tf.learn use the "between-graph" approach for distributed training.

To answer your specific questions:

Does this mean there are multiple tf.Graphs in the between-graph replication approach? If yes, where are the corresponding codes in the provided examples?

Yes. The typical between-graph replication setup will use a separate TensorFlow process for each worker replica, and each of this will build a separate tf.Graph for the model. Usually each process uses the global default graph (accessible through tf.get_default_graph()) and it is not created explicitly.

(In principle, you could use a single TensorFlow process with the same tf.Graph and multiple tf.Session objects that share the same underlying graph, as long as you configured the tf.ConfigProto.device_filters option for each session differently, but this is an uncommon setup.)
While there is already a between-graph replication example in above link, could anyone provide an in-graph replication implementation (pseudocode is fine) and highlight its main differences from between-graph replication?

For historical reasons, there are not many examples of in-graph replication (Yaroslav's gist is one exception). A program using in-graph replication will typically include a loop that creates the same graph structure for each worker (e.g. the loop on line 74 of the gist), and use variable sharing between the workers.

The one place where in-graph replication persists is for using multiple devices in a single process (e.g. multiple GPUs). The CIFAR-10 example model for multiple GPUs is an example of this pattern (see the loop over GPU devices here).

(In my opinion, the inconsistency between how multiple workers and multiple devices in a single worker are treated is unfortunate. In-graph replication is simpler to understand than between-graph replication, because it doesn't rely on implicit sharing between the replicas. Higher-level libraries, such as tf.learn and TF-Slim, hide some of these issues, and offer hope that we can offer a better replication scheme in the future.)

Why do we say each client builds a similar graph, but not the same graph?

Because they aren't required to be identical (and there is no integrity check that enforces this). In particular, each worker might create a graph with different explicit device assignments ("/job:worker/task:0", "/job:worker/task:1", etc.). The chief worker might create additional operations that are not created on (or used by) the non-chief workers. However, in most cases, the graphs are logically (i.e. modulo device assignments) the same.

Shouldn't it be multiple copies of the compute-intensive part of the model, since we have multiple workers?

Typically, each worker has a separate graph that contains a single copy of the compute-intensive part of the model. The graph for worker i does not contain the nodes for worker j (assuming i ≠ j). (An exception would be the case where you're using between-graph replication for distributed training, and in-graph replication for using multiple GPUs in each worker. In that case, the graph for a worker would typically contain N copies of the compute-intensive part of the graph, where N is the number of GPUs in that worker.)
Does the example in Replicated training support training on multiple machines, each of which has multiple GPUs?

The example code only covers training on multiple machines, and says nothing about how to train on multiple GPUs in each machine. However, the techniques compose easily. In this part of the example:
```
# Build model...
loss = ...
```
...you could add a loop over the GPUs in the local machine, to achieve distributed training multiple workers each with multiple GPUs.

Thanks for your response! Could you please take a look at the added questions? — ROBOT AI, Jan 11 '17 at 22:53
I added answers to your other questions, but I think we're at the limit of what should be covered in a single Stack Overflow question. If you have further questions, I'd ask you to post separate questions, rather than adding them to the original question. Thanks! — mrry, Jan 11 '17 at 23:19
Bravo! Thanks a lot for your answer @mrry! I do not have other questions now. :) — ROBOT AI, Jan 11 '17 at 23:35
I think, With in-graph replication for cluster training, the tf.Variable stored in /job:ps, all input variables are needed to be dispatched to all /job:worker. Except for the second requirements, I do not know why in-graph suffers from performance than between-graph. They are inherently same with the aspect of performance, I think. Can show us more details about why the performance of in-graph replication suffers from bad performance? — Yanfei Wang, Mar 15 '17 at 15:24
That's probably too involved a question for a comment, but the intuitive answer is that (at large scales) in-graph replication creates a control-plane bottleneck at the single `tf.Session` that is controlling all of the replicas. That session maintains state that has size O(num_replicas * nodes_in_graph) and it must marshal control traffic that is num_replicas times greater than an individual session must in a between-graph case. If your steps are short enough, this control overhead can become significant and degrade the throughput of training. — mrry, Mar 15 '17 at 16:13
For this "The graph for worker i does not contain the nodes for worker j (assuming i ≠ j).", I don't quite understand, don't graphs of the worker i and j contain the same nodes except for the chief worker ? — QiKuo Zhang, Nov 19 '19 at 13:17

Distributed tensorflow: the difference between In-graph replication and Between-graph replication

Edit_1: more questions

1 Answers1

Linked

Related