I have an use case where I start multiple nodes and want only one node(let's call it master node) to be able to create train_op
. Once train_op
starts on this master node, rest other nodes(let it be slaves) should be able to join with graph passed to them(without them building it themselves). Essentially, these slave nodes should be able to join the master node once the master node creates training op and is ready for training loop and rest of the time slaves should be just polling to the master node.
The only way I can do this right now is by making master do some broadcast(http or rpc) when it creates the op and also broadcast the model in some json format to the slave nodes and then slave nodes use this json data to build the graph and training op themselves and then join the distributed training as worker and ps nodes. I haven't used distributed training much so I don't know what is correct way to go about it. Are there any tensorflow APIs through which I can make it easy?
EDIT: I think I didn't specify the main question explicitly, considering this tensorflow example, specifically following line -
# Build model...
loss = ...
global_step = tf.contrib.framework.get_or_create_global_step()
How do I pass the loss itself in some format at worker nodes rather than explicitly constructing the whole graph?