How to run Tensorflow on SLURM cluster with properly configured parameter server?

Question

I am in the fortunate position of having access to my university's SLURM powered GPU cluster. I have been trying to get Tensorflow to run a in a cluster node, but so far I have failed to find any documentation. (Everyone I have spoken to at the university has run it using CPU nodes before or using a single GPU node.

I found an excellent bit of documentation from this previous question here. Unfortunately, it's rather incomplete. All of the other distributed examples I have found such as such as this one rely on explicitly specifying the parameter server.

When I try to run it using the code from the SO question, I appears to work perfectly until it either fails to connect to a nonexistent parameter server or hangs when server.join is called and no print outs are provided to the sbatch outfile (which I understand should happen).

So in short, my question is how would one go about starting Tensorflow on a SLURM cluster? From the sbatch stage onwards. This is my first time dealing with a distributed computing framework besides SPARK on AWS and I would love to learn more about how to properly configure Tensorflow. How do I specify which one of the items in the tf_hostlist for example server as the parameter server? Alternatively can I use sbatch to send slightly different commands to each worker as I have seen done in other examples?

Related question: http://stackoverflow.com/questions/34826736/running-tensorflow-on-a-slurm-cluster — Andrew Hundt, Mar 13 '17 at 18:44

How to run Tensorflow on SLURM cluster with properly configured parameter server?

0 Answers0