10

I have some troubles with the new option of tensorflow that allows us to run distributed tensorflow.

I just would like to run 2 tf.constant with 2 tasks but my code never ends. it looks like that :

import tensorflow as tf

cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
server = tf.train.Server(cluster,
                        job_name="local",
                        task_index=0)

with tf.Session(server.target) as sess:
    with tf.device("/job:local/replica:0/task:0"):
        const1 = tf.constant("Hello I am the first constant")
    with tf.device("/job:local/replica:0/task:1"):
        const2 = tf.constant("Hello I am the second constant")
    print sess.run([const1, const2])

And I have the following code that works (just with one localhost:2222):

import tensorflow as tf

cluster = tf.train.ClusterSpec({"local": ["localhost:2222"]})
server = tf.train.Server(cluster,
                        job_name="local",
                        task_index=0)

with tf.Session(server.target) as sess:
    with tf.device("/job:local/replica:0/task:0"):
        const1 = tf.constant("Hello I am the first constant")
        const2 = tf.constant("Hello I am the second constant")
    print sess.run([const1, const2])

out : ['Hello I am the first constant', 'Hello I am the second constant']

Maybe I don't understand the functions... So if you have an idea let me know please.

Thank you ;).

EDIT

Ok I found that with a ipython notebook it is not possible to run it like I did. I need to have a python program and execute it with a terminal. But now I have a new issue when I am running my code, now the server tries to connect to the 2 ports given whereas I tell him to only run on one. My new code looks like this :

import tensorflow as tf

tf.app.flags.DEFINE_string('job_name', '', 'One of local worker')
tf.app.flags.DEFINE_string('local', '', """Comma-separated list of hostname:port for the """)

tf.app.flags.DEFINE_integer('task_id', 0, 'Task ID of local/replica running the training')
tf.app.flags.DEFINE_integer('constant_id', 0, 'the constant we want to run')

FLAGS = tf.app.flags.FLAGS

local_host = FLAGS.local.split(',')

cluster = tf.train.ClusterSpec({"local": local_host})
server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_id)

with tf.Session(server.target) as sess:
    if(FLAGS.constant_id == 0):
        with tf.device('/job:local/task:'+str(FLAGS.task_id)):
            const1 = tf.constant("Hello I am the first constant")
            print sess.run(const1)
    if (FLAGS.constant_id == 1):
        with tf.device('/job:local/task:'+str(FLAGS.task_id)):
            const2 = tf.constant("Hello I am the second constant")
            print sess.run(const2)

I run the following command line

python test_distributed_tensorflow.py --local=localhost:3000,localhost:3001 --job_name=local --task_id=0 --constant_id=0

and I get the following logs

I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 970M, pci bus id: 0000:01:00.0)
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job local -> {localhost:3000, localhost:3001}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:202] Started server with target: grpc://localhost:3000
E0518 15:27:11.794873779   10884 tcp_client_posix.c:173]     failed to connect to 'ipv4:127.0.0.1:3001': socket error: connection refused
E0518 15:27:12.795184395   10884 tcp_client_posix.c:173]     failed to connect to 'ipv4:127.0.0.1:3001': socket error: connection refused
...

EDIT 2

I found the solution. it is just mandatory to run all tasks that we give to the Server. So I have to run this :

python test_distributed_tensorflow.py --local=localhost:2345,localhost:2346 --job_name=local --task_id=0 --constant_id=0 \
& \
python test_distributed_tensorflow.py --local=localhost:2345,localhost:2346 --job_name=local --task_id=1 --constant_id=1

I hope that can help someone ;)

MBT
  • 21,733
  • 19
  • 84
  • 102

1 Answers1

0

Latest version of Tensorflow provides distribution strategy to work multiple system.

With example distribution strategy is explained.Take a look at this link.