Distributed training of a wide and shallow model

Question

I am working on a very wide and shallow computation graph with a relatively small number of shared parameters on a single machine. I would like to make the graph wider but am running out of memory. My understanding is that, by using Distributed Tensorflow, it is possible to split the graph between workers by using the tf.device context manager. However it's not clear how to deal with the loss, which can only be calculated by running the entire graph, and the training operation.

What would be the right strategy to train the parameters for this kind of model?

score 1 · Accepted Answer · answered Aug 22 '17 at 16:16

TensorFlow is based on the concept of a data-flow graph. You define a graph consisting of variables and ops and you can place said variables and ops on different servers and/or devices. When you call session.Run, you pass data in to the graph and each operation between the inputs (specified in the feed_dict) and the outputs (specified in the fetches argument to session.Run) run, regardless of where those ops reside. Of course, passing data across servers incurs communication overhead, but that overhead is often made up for by the fact that you can have multiple concurrent workers performing computation simultaneously.

In short, even if you put ops on other servers, you can still compute the loss over the full graph.

Here's a tutorial on large scale linear models: https://www.tensorflow.org/tutorials/linear

And here's a tutorial on distributed training in TensorFlow: https://www.tensorflow.org/deploy/distributed

Distributed training of a wide and shallow model

1 Answers1