2

I have a problem when trying tu use slurm SBATCH jobs or SRUN jobs with MPI over infiniband.

OpenMPI is installed, and if I launch the following test program (called hello) with mpirun -n 30 ./hello it works.

// compilation: mpicc -o helloMPI helloMPI.c
#include <mpi.h>
#include <stdio.h>
int main ( int argc, char * argv [] )
{
   int myrank, nproc;
   MPI_Init ( &argc, &argv );
   MPI_Comm_size ( MPI_COMM_WORLD, &nproc );
   MPI_Comm_rank ( MPI_COMM_WORLD, &myrank );
  printf ( "hello from rank %d of %d\n", myrank, nproc );
   MPI_Barrier ( MPI_COMM_WORLD );
   MPI_Finalize (); 
   return 0;
}

so :

user@master:~/hello$ mpicc -o hello hello.c
user@master:~/hello$ mpirun -n 30 ./hello
--------------------------------------------------------------------------
[[5627,1],2]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: usNIC
  Host: master

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
hello from rank 25 of 30
hello from rank 1 of 30
hello from rank 6 of 30
[...]
hello from rank 17 of 30

When I try to launch it through SLURM I get segmentation faults like this:

user@master:~/hello$ srun -n 20 ./hello
[node05:01937] *** Process received signal ***
[node05:01937] Signal: Segmentation fault (11)
[node05:01937] Signal code: Address not mapped (1)
[node05:01937] Failing at address: 0x30
[node05:01937] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7fcf6bf7ecb0]
[node05:01937] [ 1] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x244c6)[0x7fcf679b64c6]
[node05:01937] [ 2] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x254cb)[0x7fcf679b74cb]
[node05:01937] [ 3] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0xb1)[0x7fcf679b2141]
[node05:01937] [ 4] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x10ad0)[0x7fcf679a2ad0]
[node05:01937] [ 5] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_btl_base_select+0x114)[0x7fcf6c209b34]
[node05:01937] [ 6] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12)[0x7fcf67bca652]
[node05:01937] [ 7] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_bml_base_init+0x69)[0x7fcf6c209359]
[node05:01937] [ 8] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_pml_ob1.so(+0x5975)[0x7fcf65d1b975]
[node05:01937] [ 9] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_pml_base_select+0x35c)[0x7fcf6c21a0bc]
[node05:01937] [10] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(ompi_mpi_init+0x4ed)[0x7fcf6c1cb89d]
[node05:01937] [11] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(MPI_Init+0x16b)[0x7fcf6c1eb56b]
[node05:01937] [12] /home/user/hello/./hello[0x400826]
[node05:01937] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7fcf6bbd076d]
[node05:01937] [14] /home/user/hello/./hello[0x400749]
[node05:01937] *** End of error message ***
[node05:01938] *** Process received signal ***
[node05:01938] Signal: Segmentation fault (11)
[node05:01938] Signal code: Address not mapped (1)
[node05:01938] Failing at address: 0x30
[node05:01940] *** Process received signal ***
[node05:01940] Signal: Segmentation fault (11)
[node05:01940] Signal code: Address not mapped (1)
[node05:01940] Failing at address: 0x30
[node05:01938] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f68b2e10cb0]
[node05:01938] [ 1] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x244c6)[0x7f68ae8484c6]
[node05:01938] [ 2] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x254cb)[0x7f68ae8494cb]
[node05:01940] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f8af1d82cb0]
[node05:01940] [ 1] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x244c6)[0x7f8aed7ba4c6]
[node05:01940] [ 2] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x254cb)[0x7f8aed7bb4cb]
[node05:01940] [ 3] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0xb1)[0x7f8aed7b6141]
[node05:01940] [ 4] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x10ad0)[0x7f8aed7a6ad0]
[node05:01938] [ 3] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0xb1)[0x7f68ae844141]
[node05:01938] [ 4] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x10ad0)[0x7f68ae834ad0]
[node05:01938] [ 5] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_btl_base_select+0x114)[0x7f68b309bb34]
[node05:01938] [ 6] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12)[0x7f68aea5c652]
[node05:01940] [ 5] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_btl_base_select+0x114)[0x7f8af200db34]
[node05:01940] [ 6] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12)[0x7f8aed9ce652]
[node05:01938] [ 7] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_bml_base_init+0x69)[0x7f68b309b359]
[node05:01938] [ 8] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_pml_ob1.so(+0x5975)[0x7f68acbad975]
[node05:01940] [ 7] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_bml_base_init+0x69)[0x7f8af200d359]
[node05:01940] [ 8] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_pml_ob1.so(+0x5975)[0x7f8aebb1f975]
[node05:01940] [ 9] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_pml_base_select+0x35c)[0x7f8af201e0bc]
[node05:01938] [ 9] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_pml_base_select+0x35c)[0x7f68b30ac0bc]
[node05:01938] [10] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(ompi_mpi_init+0x4ed)[0x7f68b305d89d]
[node05:01940] [10] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(ompi_mpi_init+0x4ed)[0x7f8af1fcf89d]
[node05:01938] [11] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(MPI_Init+0x16b)[0x7f68b307d56b]
[node05:01938] [12] /home/user/hello/./hello[0x400826]
[node05:01940] [11] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(MPI_Init+0x16b)[0x7f8af1fef56b]
[node05:01940] [12] /home/user/hello/./hello[0x400826]
[node05:01938] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f68b2a6276d]
[node05:01938] [14] /home/user/hello/./hello[0x400749]
[node05:01938] *** End of error message ***
[node05:01940] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f8af19d476d]
[node05:01940] [14] /home/user/hello/./hello[0x400749]
[node05:01940] *** End of error message ***
[node05:01939] *** Process received signal ***
[node05:01939] Signal: Segmentation fault (11)
[node05:01939] Signal code: Address not mapped (1)
[node05:01939] Failing at address: 0x30
[...]etc

Does anyone know what is the problem? I have built openMPI with Slurm support, and installed the same version of compilers and libs, in fact all the libs are in a NFS share which is mounted on each node.

remarks:

It should use infiniband, as it is installed. But when I lauch openmpi with mpirun I notice the

[[5627,1],2]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: usNIC
  Host: cluster

which I guess means "not running through infiniband". I have installed infiniband drivers, and set up IP over Infiniband. Slurm is configured to run with infiniband IPs : is it a right configuration?

Thanks in advance Best regards

EDIT :

I have just tried to compile it with MPICH2 instead of openMPI and it work with SLURM. So the problem is probably related to openMPI and not Slurm configuration?

EDIT 2: Actually, I have seen that using openMPI 1.6.5 (instead of 1.8) with SBATCH command instead of SRUN my script is executed (i.e. it returns the thread number, rank and host. But it shows warnings related to the openfabric vendor and allocation of registered memory:

The OpenFabrics (openib) BTL failed to initialize while trying to
allocate some locked memory.  This typically can indicate that the
memlock limits are set too low.  For most HPC installations, the
memlock limits should be set to "unlimited".  The failure occured
here:

  Local host:    node05
  OMPI source:   btl_openib_component.c:1216
  Function:      ompi_free_list_init_ex_new()
  Device:        mlx4_0
  Memlock limit: 65536

You may need to consult with your system administrator to get this
problem fixed.  This FAQ entry on the Open MPI web site may also be
helpful:

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   node05
  Local device: mlx4_0
--------------------------------------------------------------------------
Hello world from process 025 out of 048, processor name node06
Hello world from process 030 out of 048, processor name node06
Hello world from process 032 out of 048, processor name node06
Hello world from process 046 out of 048, processor name node07
Hello world from process 031 out of 048, processor name node06
Hello world from process 041 out of 048, processor name node07
Hello world from process 034 out of 048, processor name node06
Hello world from process 044 out of 048, processor name node07
Hello world from process 033 out of 048, processor name node06
Hello world from process 045 out of 048, processor name node07
Hello world from process 026 out of 048, processor name node06
Hello world from process 043 out of 048, processor name node07
Hello world from process 024 out of 048, processor name node06
Hello world from process 038 out of 048, processor name node07
Hello world from process 014 out of 048, processor name node05
Hello world from process 027 out of 048, processor name node06
Hello world from process 036 out of 048, processor name node07
Hello world from process 019 out of 048, processor name node05
Hello world from process 028 out of 048, processor name node06
Hello world from process 040 out of 048, processor name node07
Hello world from process 023 out of 048, processor name node05
Hello world from process 042 out of 048, processor name node07
Hello world from process 018 out of 048, processor name node05
Hello world from process 039 out of 048, processor name node07
Hello world from process 021 out of 048, processor name node05
Hello world from process 047 out of 048, processor name node07
Hello world from process 037 out of 048, processor name node07
Hello world from process 015 out of 048, processor name node05
Hello world from process 035 out of 048, processor name node06
Hello world from process 020 out of 048, processor name node05
Hello world from process 029 out of 048, processor name node06
Hello world from process 016 out of 048, processor name node05
Hello world from process 017 out of 048, processor name node05
Hello world from process 022 out of 048, processor name node05
Hello world from process 012 out of 048, processor name node05
Hello world from process 013 out of 048, processor name node05
Hello world from process 000 out of 048, processor name node04
Hello world from process 001 out of 048, processor name node04
Hello world from process 002 out of 048, processor name node04
Hello world from process 003 out of 048, processor name node04
Hello world from process 006 out of 048, processor name node04
Hello world from process 009 out of 048, processor name node04
Hello world from process 011 out of 048, processor name node04
Hello world from process 004 out of 048, processor name node04
Hello world from process 007 out of 048, processor name node04
Hello world from process 008 out of 048, processor name node04
Hello world from process 010 out of 048, processor name node04
Hello world from process 005 out of 048, processor name node04
[node04:04390] 47 more processes have sent help message help-mpi-btl-openib.txt / init-fail-no-mem
[node04:04390] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[node04:04390] 47 more processes have sent help message help-mpi-btl-openib.txt / error in device init

What I understand from that, is that a) v.1.6.5 has a better error handling and b) I have to configure openMPI and/or infiniband drivers with more registered memory size. I see this page and apparently I only need to modify openMPI stuff? I have to test it...

Danduk82
  • 769
  • 1
  • 10
  • 29

2 Answers2

0

Two things: to "srun ... mpi_app", you need to do special things in OMPI. See http://www.open-mpi.org/faq/?category=slurm for how to run Open MPI jobs under SLURM.

The usnic message seems like a legitimate bug report that you should submit to the Open MPI user's mailing list:

http://www.open-mpi.org/community/lists/ompi.php

In particular, I would like to see some details in order to figure out why you're getting the warning message about usNIC (I'm guessing you're not running on a Cisco UCS platform with usNIC installed, but if you have IB installed, you shouldn't see this message).

Jeff Squyres
  • 744
  • 4
  • 6
  • Thanks for your answer, actually I have seen the pages you are suggesting, and I use it in this way. But I have seen in the logs that when using it through slurm, it has some problems registering the memory through infiniband fabric. I found this page: http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem and I am trying to solve it. What I don't understand for the moment, is if I have to change some inifniband parameters (maybe re-compile the kernel module with additional parameters) or if I have to change some openMPI config... – Danduk82 Apr 23 '14 at 07:08
  • 1
    You probably want to see this FAQ item: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-more I'm guessing that your SLURM resource daemon is not starting up with the right locked pages limits. You should not need to recompile your kernel; just allow SLURM to use all locked memory. When you run under MPICH, you're not using InfiniBand; that's why you don't get the same kinds of warnings / errors. BTW, if you care, we just fixed the "usnic" warning that you saw; that will be included in 1.8.2: https://svn.open-mpi.org/trac/ompi/changeset/31490 – Jeff Squyres Apr 23 '14 at 13:02
  • Hmm, why do you think I'm not using inifinband with mpich? What should I look? I guess they are, as infiniband ip are the ones set in slurm.conf... no? – Danduk82 Apr 23 '14 at 17:55
  • 1
    MPICH does not support (native) InfiniBand. IB supports an IP emulation layer (i.e., you can have IP addresses on an IB network), but it's significantly lower performance than native IB support. So MPICH can use the IP addresses on your IB network. Open MPI uses native IB support, which is why it needs access to locked memory, etc. – Jeff Squyres Apr 23 '14 at 21:17
  • Haha! So you're telling I don't need IPoIB in my slurm configuration file, and that openMPI should "understand" by itself how to use IB? I guess I'm missing one "brick" in my understanding of the configuration... – Danduk82 Apr 24 '14 at 07:42
  • 1
    That is correct. For further explanations of IB, you should probably read your IB vendor's documentation. FWIW, Cisco stopped setting IB years ago -- I haven't been actively involved in IB stuff in quite a while. – Jeff Squyres Apr 24 '14 at 12:33
  • I have posted a couple of questions on Mellanox (my IB vendor) forum but I didn't get an answer until now... Anyway, I think the question can be considered *answered* : It's a problem of infiniband configuration. Thanks for you answers. – Danduk82 Apr 24 '14 at 12:41
0
  1. My solution: upgrade to Slurm 14.03.2-1, OpenMPI 1.8.1.

  2. Bizarrely, I ran into exactly this problem on some of my nodes (segfault on btl openib) after an Infiniband network reorganisation. I was using Slurm 2.6.9 and OpenMPI 1.8.

On the racks with Dell/AMD Opteron/Mellanox it would segfault (and it was working before a the network reorganisation.)

Racks with HP/Intel/Mellanox continue to work pre- and post- reorg.

This may have something to do with the Infiniband topology.

AAlba
  • 1
  • I have upgraded to **Slurm 2.6.10** and it works. I'm using both openMPI 1.6.5 and 1.8. I had to compile the mellanox driver with MXM support and slurm with PMI. Then configure openMPI to use both (MXM and PMI) with custom places. Now I have to say that it is not bad. I have a wonder about the bandwith, because I tried a few benchmark codes I found on the net, and apparently my bandwith is at 4Gb/s when it should be 10 Gb/s. And I still don't know why. – Danduk82 May 06 '14 at 17:03