I am very new to OpenMPI/Parallel Computing in general. I have been following this guide here on how to do it, and I have gone through literally 99% of the guide without a problem, it’s this final test that has been giving me issue, and I haven’t quite worked out what’s going wrong. When I perform the “compute-pi” sbatch job, I am greeted with this error.
Slurm Output File when submitting
However, when I set n=4 (ie have it run on one node), compute-pi runs just fine. Furthermore,I can also do srun commands without issue across all nodes, so the master node can schedule jobs across the cluster without a problem.I think once the nodes have to start communicating with each other, things start to break down.
I would expect the slurm.out file to have a calculated approximation of pi, but it never completes the task because of this issue. Furthermore, I tried adding "--mca oob_tcp_if_include eth0 --mca btl_tcp_if_include eth0" to the command which I saw on other forums, but that didn't help. I also opened up all of the Iptables to allow traffic from all IP's, but again that did not work.
Any suggestions are greatly appreciated. If you need anymore information, let me know and I would be happy to oblige.