Segfaults when running OpenMPI job inside Slurm runscript

Question

We are running a small cluster environment with Intel Xeon nodes connected via Infiniband. The login node is not attached to the infiniband interconnect. All nodes run Debian Jessie.

We run Slurm 14.03.9 on the Login node. As the system OpenMPI is outdated and does not support the MPI3-Interface (which I require), I compiled a custom OpenMPI 2.0.1.

When I start MPI jobs by hand via

mpirun --hostfile hosts -np xx program_name,

it runs fine, also on multiple nodes, and takes full advantage of Infiniband. Good.

However, when I call my MPI application from inside a Slurm runscript, it crashes with strange Segfaults. I compiled OpenMPI with Slurm support, and also the PMI seems to work, so I can simply write

mpirun program_name

in the Slurm runscript, and it automatically dispatches the jobs to the correct nodes with the correct number of CPU cores. However, I keep getting these segfaults.

Explicitly specifying "-np" and "--hostfile" to mpirun in the Slurm runscript also does not help. The exactly same command which runs fine when started by hand leads to a segfault when started inside the Slurm environment.

Before the segfaults occur, I get the following error message from OpenMPI:

--------------------------------------------------------------------------
Failed to create a completion queue (CQ):

Hostname: xxxx
Requested CQE: 16384
Error:    Cannot allocate memory

Check the CQE attribute.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly.  This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.

Hostname: xxxx
--------------------------------------------------------------------------

I googled for it, but did not find much useful imformation. I assumed that it might be a limit on locked memory, but executing "ulimit -l" on the compute nodes returns "unlimited" as it should.

I appreciate any help to get my jobs to run with OpenMPI inside the Slurm environment.

have you run the ulimit in a slurm job? It may be that slurm has been started without the proper locked memory limit. — Carles Fenoy, Sep 15 '16 at 14:13
@Carles Fenoy: Very good point! When I run "ulimit -l" inside the slurm runscript, it displays 64, which is very little. However, if I login to the same compute node via ssh and run "ulimit -l", it shows "unlimited". If I try to change the limit inside the runscript, I get "max locked memory: cannot modify limit: Operation not permitted". Is there a way to allow users to change this limit? I have root access on all nodes. — Brehministrator, Sep 15 '16 at 15:44
It needs to be changed in the slurm init script, adding an ulimit -l unlimited at the beginning — Carles Fenoy, Sep 15 '16 at 16:50
Thanks. I added "ulimit -l unlimited" in the /etc/init.d/slurmd startup script on the node where my job runs, but I still get returned "64" inside my job runscript. I restarted the slurm daemon on that node, and even rebooted the machine, but still the same. I find this quite puzzling. I also added an according entry to /etc/security/limits.conf (but as far as I know, daemons do not read this file at startup). — Brehministrator, Sep 15 '16 at 17:36
check that the slurmd process limits are properly set. Once you have restarted the service, check the limits with `cat /proc/PID/limits` — Carles Fenoy, Sep 16 '16 at 10:35
@Carles Fenoy: The process limits of the Slurm daemon are indeed too low, and I really don't understand why. I have `ulimit -l unlimited` in the `/etc/init.d/slurmd` startup script, the same also in `/etc/default/slurmd` (albeit redundant), and also `root hard memlock unlimited` in my `/etc/security/limits.conf`. When I log on as any user or as root, I see the correct (not-existing) limit. The daemon still has the wrong low limit. Restarted daemon and rebooted machine several times. Also tried with a large number instead of `unlimited` - same problem. Any further ideas what goes wrong here? — Brehministrator, Sep 16 '16 at 15:50
@Carles Fenoy: It is a up-to-date standard Debian Jessie installation, which should use systemd by default. I just checked it, and can confirm that. Process with PID 1 is `/sbin/init`, but this is just a link to systemd. When using systemd, is there another way how to change the the limits for a daemon? I am not familiar with systemd, sorry. — Brehministrator, Sep 16 '16 at 16:41
I don't know exactly how it is implemented in debian, but for CentOS you can add `LimitMEMLOCK=unlimited` to the service file in "/usr/lib/systemd/system/slurm.service". After the change, remember to reload the systemd config with `systemctl daemon-reload` and the restart the service — Carles Fenoy, Sep 16 '16 at 16:53
@Carles Fenoy: That worked out exactly like you described it. Thanks! Now I will check if it solves my original problem of the segfaults. — Brehministrator, Sep 16 '16 at 17:13

score 2 · Accepted Answer · answered Sep 16 '16 at 18:45

Finally, I was able to resolve the problem.

The segfaults were indeed related to the error message posted above, which was a consequence of a "max locked memory" limit on the compute node where Slurm dispatched the job.

I struggled long time to lift this locked memory limit. All the standard procedures one finds via Google did not work (neither editing /etc/security/limits.conf nor editing /etc/init.d/slurmd). The reason was that my Debian Jessie nodes use systemd, which does not honor these files. I had to add the line

[Service]
LimitMEMLOCK=32768000000

into the file /etc/systemd/system/multi-user.target.wants/slurmd.service on all my nodes. It did not work with unlimited, so I had to use the total system RAM in bytes instead. After modifying this file, I executed

systemctl daemon-reload
systemctl restart slurmd

on all nodes, and finally the problems vanished. Thank you, Carles Fenoy, for your valuable comments!

I faced the same problem running my Node.js application using `mpirun` with [pm2](http://pm2.keymetrics.io/). Setting `LimitMEMLOCK` for `pm2-` service also works for me. Thanks a lot for sharing this solution. — ezze, May 03 '18 at 15:21
IIUC, you say setting `LimitMEMLOCK=unlimited` in `slurmd.service` didn't help, but shouldn't this be `LimitMEMLOCK=infinity`? — loris, Sep 04 '19 at 07:10

Segfaults when running OpenMPI job inside Slurm runscript

1 Answers1