I am new to using Microsoft Azure for scientific computing purposes and have encountered a few issues whilst setting up.
I have a jump box set-up that acts as a license server for the software that I whish to use, is also has a common drive to store all of the software. 6 compute nodes are also set-up (16 core/node) and I can 'ssh' from the jump box to the compute nodes without issue. The jump box and compute nodes are using CentOS with OpenMPI 1.10.3
I have created a script that is stored on the mounted jump box drive that I run on each compute node through 'clusRun.sh' which sets up all the environment variable specific to the software I run and OpenMPI. Hopefully it all sounds good to this point.
I've used this software on Linux clusters a lot in the past without issue. The jobs are submitted using a command similar such as:
mpirun -np XXX -hostfile XXX {path to software}
Where XXX is the number of processors and path to hostfile
I run this command on the jump box and the hostfile has a list of the names of each compute node, each compute node name is in the hostfile the same number of times as cores I want on the node. Hope that makes sense! There are no processes from the job running on the jump box node, it's merely used to launch the job.
When I try and run the jobs this way, I receive a number of errors, most seem to be tied up with Infiniband. Here is a short list of the key errors:
"The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out"
"The OpenFabrics (openib) BTL failed to initialize while trying to create an internal queue"
"OMPI source: btl_openib.c:324
Function: ibv_create_srq()
Error: Function not implemented (errno=38)
Device: mlx4_0"
"At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes"
Are there any environment variables specific to OpenMPI that need to be set-up that define any Infiniband settings? I have already defined the usual MPI_BIN, LD_LIBRARY_PATH, PATH etc. I know that IntelMPI requires additional variables.
The Infiniband should come as part of the A9 HPC allocation, however I'm not sure if it need any specific setting up. When I run 'ifconfig -a' there are no Infiniband specific entries (I expect to see ib0, ib1 etc). I just have eth0, eth1 and lo
I look forward to any advise that someone might be able to offer.
Kind regards!