OpenMPI node isn't recognizing the IP of another node in the cluster

Question

I am very new to OpenMPI/Parallel Computing in general. I have been following this guide here on how to do it, and I have gone through literally 99% of the guide without a problem, it’s this final test that has been giving me issue, and I haven’t quite worked out what’s going wrong. When I perform the “compute-pi” sbatch job, I am greeted with this error.

Slurm Output File when submitting

However, when I set n=4 (ie have it run on one node), compute-pi runs just fine. Furthermore,I can also do srun commands without issue across all nodes, so the master node can schedule jobs across the cluster without a problem.I think once the nodes have to start communicating with each other, things start to break down.

I would expect the slurm.out file to have a calculated approximation of pi, but it never completes the task because of this issue. Furthermore, I tried adding "--mca oob_tcp_if_include eth0 --mca btl_tcp_if_include eth0" to the command which I saw on other forums, but that didn't help. I also opened up all of the Iptables to allow traffic from all IP's, but again that did not work.

Any suggestions are greatly appreciated. If you need anymore information, let me know and I would be happy to oblige.

score 0 · Answer 1 · answered Apr 03 '23 at 10:46

Do you run docker as well on those nodes? Can you run ip addr and share the result? (Sorry I add these questions in an answer but can't add comments because of not high enough reputation score).

I'm asking the questions because I see others having the same issues while running docker in parallel on the nodes: https://forums.developer.nvidia.com/t/open-mpi-network-setup/211658

OpenMPI node isn't recognizing the IP of another node in the cluster

1 Answers1