Performance improvement using Between-Graph replication in distributed tensorflow

Question

I have gone through this answer, but it didn't give the rationale for choosing multiple clients in Between-Graph replication for improving performance. How will using Between-Graph replication improve performance, when compared to In-Graph replication?

Could you clarify what was unclear in the answer you linked? I answered before you fixed the link, but really that answer is a lot better and more detailed than mine (it was written by [the guy explaining distributed tensorflow in this video](https://youtu.be/la_M6bCV91M), which I encourage you to watch by the way). If you clarify what's unclear, I can perhaps focus on those points and improve my answer — GPhilo, Feb 20 '18 at 13:49
@GPhilo Thanks for taking time to answer the question. I wanted to understand how exactly performance improves by using Between-Graph replication. Since both of the replication methods replicates the model in each worker, abstraction about having more clients improving performance was bugging me. I am novice to the area of distributed computing, so please bear with my question. — kanishka, Feb 20 '18 at 14:53
Thanks for the clarification, I'll add a section to my answer to -hopefully- clarify better the differences between the two approaches. — GPhilo, Feb 20 '18 at 15:00

score 7 · Accepted Answer · edited Jun 20 '20 at 09:12

In-graph replication works fine for multiple devices on the same machine, but it doesn't scale well to cluster-size, because one client has to take care of coordination between all devices (even those located on different nodes).

Say, for example, that you have two GPUs, one on the client's machine and another on a second machine. Thanks to Tensorflow's magic, a simple with tf.device('address_of_the_gpu_on_the_other_machine'): will place operations on the remote computer's GPU. The graph will then run on both machines, but data will then need to be gathered from both before being able to proceed in the computation (loss computation, etc). Network communication will slow down your training (and of course, the more machines, the more communication needed).

Between-graph replication, on the other hand, scales much better because each machine has its own client that only needs to coordinate communication to the parameter server and execution of its own operations. Graphs "overlap" on the parameter server, which updates one set of variables that are shared among all the worker graphs. Moreover, communication overhead is also greatly reduced, because now you only need to have fast communication to the parameter servers, but no machine needs to wait for other machines to complete before moving on to the next training iteration.

How are the graphs different between the two methods?

In-graph replication:

In this method, you have only one graph managed by the client. This graph have nodes that are spread over multiple devices, even across different machines. This means that, for example, having two machines PC1 and PC2 on a network, the client will explicitly dispatch operations to one machine or the other. The graph technically is not "replicated", only some parts of it are distributed. Typically, the client has a big batch of data that is split in sub-batches, each of which is fed to a compute-intensive part of the graph. Only this compute-intensive part is replicated, but all the part before the split and after the computation (e.g., loss calculation) runs on the client. This is a bottleneck.

Note, also, that it´'s the client that decides which operations go to which machine, so theoretically one could have different parts of the graph on different nodes. You can decide to replicate identically the compute-intensive part on all your nodes, or you could, in principle, say "all the convolutions are on PC1, all dense layers go to PC2". Tensorflow's magic will insert data transfers where appropriate to make things work for you.

Between-graph replication:

Here you have multiple similar copies of the same graph. Why similar? because all of them have the compute-intensive part (as above), but also the input pipeline, the loss calculation and their own optimizer (assuming you're using asynchronous training (the default). This is another layer of complexity that I'll leave aside). (Delving deeper in Tensorflow's distributed framework, you'll also find out that not all workers (and their graphs) are equal, there is one "chief" worker that does initialization, checkpointing and summary logging, but this is not critical to understanding the general idea).

Unlike above, here you need a special machine, the parameter server (PS), that acts as central repository for the graph's variables (Caveat: not all the variables, only the global ones, like global_step and the weights of your network). You need this because now at each iteration, every worker will fetch the most recent values of the variables at each iteration of the training step. It then sends to the PS the updates that must be applied to the variables and the PS will actually do the update.

How is this different from the method above? For one thing, there is no "big batch" that gets split among workers. Every worker processes as much data as it can handle, there is no need for splitting and putting things back together afterwards. This means, there is no need for synchronization of workers, because the training loops are entirely independent. The training, however, is not independent, because the updates that worker A does to the variables will be seen by worker B, because they both share the same variables. This means that the more workers you have, the faster the training (subject to diminished returns) because effectively the variables are updated more often (approximately every time_for_a_train_loop/number_of_workers seconds). Again, this happens without coordination between workers, which incidentally also makes the training more robust: if a worker dies, the others can continue (with some caveats due to having a chief worker).

One last cool feature of this method is that, in principle, there is no loss in performance using a heterogeneous cluster. Every machine runs as fast as it can and awaits nobody. Should you try running in-graph replication on a heterogeneous cluster, you'd be limited in speed by the slowest machine (because you collect all results before continuing).

@kanishka I added a section with a bit more detail, I hope this will help clarifying things a bit ;) — GPhilo, Feb 20 '18 at 15:29
Thank you for the detailed explanation about the differences. It really helped me to get a good intuition about both types of replications. — kanishka, Feb 21 '18 at 07:01

Performance improvement using Between-Graph replication in distributed tensorflow

1 Answers1

How are the graphs different between the two methods?

In-graph replication:

Between-graph replication: