I have access to a computer cluster with the SLURM manager. I want to achieve that different nodes execute different parts of my code. If I understood properly, this can be achieved through SLURM with the srun command if your code is properly written. It should be something like this MPI example here https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html .
But I don't understand how to create this code in TF. There is a lot more information for version 1 of TF. If I try something like this
jobs={'worker': 4}
cluster=tf.distribute.cluster_resolver.SlurmClusterResolver(jobs=jobs)
server0 = tf.distribute.Server(cluster.cluster_spec(), job_name='worker', task_index=0)
server1 = tf.distribute.Server(cluster.cluster_spec(), job_name='worker', task_index=1)
server2 = tf.distribute.Server(cluster.cluster_spec(), job_name='worker', task_index=2)
server2 = tf.distribute.Server(cluster.cluster_spec(), job_name='worker', task_index=3)
and run this using SLURM, I get an error, where I see that only the first server has started, but the second was trying to use the same address, which is 'localhost:8888'. So essentially, I do not know how to create servers on different nodes which can latter communicate. Should I run different scripts simultaneously? Must I use the command line with flags or something like that?
Afterwards, my idea is to use
with tf.device("/job:worker/task:0"):
#some code
with tf.device("/job:worker/task:1"):
#some other code
to distribute work. Any help? I don't think that I can use any of the distribute strategies TF has to offer.