I have some troubles with the new option of tensorflow that allows us to run distributed tensorflow.
I just would like to run 2 tf.constant with 2 tasks but my code never ends. it looks like that :
import tensorflow as tf
cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
server = tf.train.Server(cluster,
job_name="local",
task_index=0)
with tf.Session(server.target) as sess:
with tf.device("/job:local/replica:0/task:0"):
const1 = tf.constant("Hello I am the first constant")
with tf.device("/job:local/replica:0/task:1"):
const2 = tf.constant("Hello I am the second constant")
print sess.run([const1, const2])
And I have the following code that works (just with one localhost:2222):
import tensorflow as tf
cluster = tf.train.ClusterSpec({"local": ["localhost:2222"]})
server = tf.train.Server(cluster,
job_name="local",
task_index=0)
with tf.Session(server.target) as sess:
with tf.device("/job:local/replica:0/task:0"):
const1 = tf.constant("Hello I am the first constant")
const2 = tf.constant("Hello I am the second constant")
print sess.run([const1, const2])
out : ['Hello I am the first constant', 'Hello I am the second constant']
Maybe I don't understand the functions... So if you have an idea let me know please.
Thank you ;).
EDIT
Ok I found that with a ipython notebook it is not possible to run it like I did. I need to have a python program and execute it with a terminal. But now I have a new issue when I am running my code, now the server tries to connect to the 2 ports given whereas I tell him to only run on one. My new code looks like this :
import tensorflow as tf
tf.app.flags.DEFINE_string('job_name', '', 'One of local worker')
tf.app.flags.DEFINE_string('local', '', """Comma-separated list of hostname:port for the """)
tf.app.flags.DEFINE_integer('task_id', 0, 'Task ID of local/replica running the training')
tf.app.flags.DEFINE_integer('constant_id', 0, 'the constant we want to run')
FLAGS = tf.app.flags.FLAGS
local_host = FLAGS.local.split(',')
cluster = tf.train.ClusterSpec({"local": local_host})
server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_id)
with tf.Session(server.target) as sess:
if(FLAGS.constant_id == 0):
with tf.device('/job:local/task:'+str(FLAGS.task_id)):
const1 = tf.constant("Hello I am the first constant")
print sess.run(const1)
if (FLAGS.constant_id == 1):
with tf.device('/job:local/task:'+str(FLAGS.task_id)):
const2 = tf.constant("Hello I am the second constant")
print sess.run(const2)
I run the following command line
python test_distributed_tensorflow.py --local=localhost:3000,localhost:3001 --job_name=local --task_id=0 --constant_id=0
and I get the following logs
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 970M, pci bus id: 0000:01:00.0)
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job local -> {localhost:3000, localhost:3001}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:202] Started server with target: grpc://localhost:3000
E0518 15:27:11.794873779 10884 tcp_client_posix.c:173] failed to connect to 'ipv4:127.0.0.1:3001': socket error: connection refused
E0518 15:27:12.795184395 10884 tcp_client_posix.c:173] failed to connect to 'ipv4:127.0.0.1:3001': socket error: connection refused
...
EDIT 2
I found the solution. it is just mandatory to run all tasks that we give to the Server. So I have to run this :
python test_distributed_tensorflow.py --local=localhost:2345,localhost:2346 --job_name=local --task_id=0 --constant_id=0 \
& \
python test_distributed_tensorflow.py --local=localhost:2345,localhost:2346 --job_name=local --task_id=1 --constant_id=1
I hope that can help someone ;)