2

I have an use case where I start multiple nodes and want only one node(let's call it master node) to be able to create train_op. Once train_op starts on this master node, rest other nodes(let it be slaves) should be able to join with graph passed to them(without them building it themselves). Essentially, these slave nodes should be able to join the master node once the master node creates training op and is ready for training loop and rest of the time slaves should be just polling to the master node.

The only way I can do this right now is by making master do some broadcast(http or rpc) when it creates the op and also broadcast the model in some json format to the slave nodes and then slave nodes use this json data to build the graph and training op themselves and then join the distributed training as worker and ps nodes. I haven't used distributed training much so I don't know what is correct way to go about it. Are there any tensorflow APIs through which I can make it easy?

EDIT: I think I didn't specify the main question explicitly, considering this tensorflow example, specifically following line -

# Build model...
loss = ...
global_step = tf.contrib.framework.get_or_create_global_step()

How do I pass the loss itself in some format at worker nodes rather than explicitly constructing the whole graph?

Abhishek Singh
  • 87
  • 2
  • 13
  • I'm currently learning about distributed TF myself but I'm trying to do something similar to what (I think) you want. You can have worker and PS servers running and then create graphs and sessions in one master server that are targeted to those remote servers. My current approach is one script with multiple threads, each of which creates a graph and starts a session for each (remote) worker, but in principle you can also create one single graph and start multiple remote sessions on it (see end of point 1. [here](https://stackoverflow.com/a/41601168/1782792)). – jdehesa Jul 03 '18 at 09:29
  • Thank you, that link is very useful. I wish tensorflow had more documentation or examples for distributed training. – Abhishek Singh Jul 03 '18 at 16:35

0 Answers0