5

I am in the fortunate position of having access to my university's SLURM powered GPU cluster. I have been trying to get Tensorflow to run a in a cluster node, but so far I have failed to find any documentation. (Everyone I have spoken to at the university has run it using CPU nodes before or using a single GPU node.

I found an excellent bit of documentation from this previous question here. Unfortunately, it's rather incomplete. All of the other distributed examples I have found such as such as this one rely on explicitly specifying the parameter server.

When I try to run it using the code from the SO question, I appears to work perfectly until it either fails to connect to a nonexistent parameter server or hangs when server.join is called and no print outs are provided to the sbatch outfile (which I understand should happen).

So in short, my question is how would one go about starting Tensorflow on a SLURM cluster? From the sbatch stage onwards. This is my first time dealing with a distributed computing framework besides SPARK on AWS and I would love to learn more about how to properly configure Tensorflow. How do I specify which one of the items in the tf_hostlist for example server as the parameter server? Alternatively can I use sbatch to send slightly different commands to each worker as I have seen done in other examples?

Skylion
  • 2,696
  • 26
  • 50

0 Answers0