0

I have access to a computer cluster with the SLURM manager. I want to achieve that different nodes execute different parts of my code. If I understood properly, this can be achieved through SLURM with the srun command if your code is properly written. It should be something like this MPI example here https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html .

But I don't understand how to create this code in TF. There is a lot more information for version 1 of TF. If I try something like this

jobs={'worker': 4}
cluster=tf.distribute.cluster_resolver.SlurmClusterResolver(jobs=jobs)
server0 = tf.distribute.Server(cluster.cluster_spec(), job_name='worker', task_index=0)
server1 = tf.distribute.Server(cluster.cluster_spec(), job_name='worker', task_index=1)
server2 = tf.distribute.Server(cluster.cluster_spec(), job_name='worker', task_index=2)
server2 = tf.distribute.Server(cluster.cluster_spec(), job_name='worker', task_index=3)

and run this using SLURM, I get an error, where I see that only the first server has started, but the second was trying to use the same address, which is 'localhost:8888'. So essentially, I do not know how to create servers on different nodes which can latter communicate. Should I run different scripts simultaneously? Must I use the command line with flags or something like that?

Afterwards, my idea is to use

with tf.device("/job:worker/task:0"):
#some code
with tf.device("/job:worker/task:1"):
#some other code

to distribute work. Any help? I don't think that I can use any of the distribute strategies TF has to offer.

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
Nevena
  • 11
  • 3

1 Answers1

0

It seems that I have found a solution, so I am posting it, maybe it will help someone. It seems that

cluster = tf.compat.v1.train.ClusterSpec({'worker': ['n03:2222', 'n04:2223'] })

instead of cluster_resolver solves the problem of the same addresses. Later, I needed to invoke a session, and it had to be a session with the target of the server related to task 1 (not sure why, probably it has something to do with the master node), like this:

with tf.compat.v1.Session(server1.target) as sess:
    x=tf.Variable(...)
    for k in range(n):
        y1=f1(x)
        y2=f2(x)
        y1=y1.eval()
        y2=y2.eval()

where f1(x) is a tf.function where the distributing to workers happens, for example:

@tf.function
def f1(x):
    with tf.device("/job:worker/task:0"):
         y=...
         x.assign(x+1)
    return y

and f2(x) is similar, just with task 1. This was all in one script which I call in a .sh file.

Nevena
  • 11
  • 3