1

On newly installed and configured compute nodes in our small cluster I am unable to submit slurm jobs using a batch script and the 'sbatch' command. After submitting, the requested node changes to the 'drained' status. However, I can run the same command interactively using 'srun'.

Works:
srun -p debug --ntasks=1 --nodes=1 --job-name=test --nodelist=node6 -l echo 'test'

Does not work:
sbatch test.slurm
with test.slurm:

#!/bin/sh
#SBATCH --job-name=test
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --nodelist=node6
#SBATCH --partition=debug

echo 'test'

It gives me:

PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug         up    1:00:00      1  drain node6

and I have to resume the node.

All nodes run Debian 9.8, use Infiniband and NIS. I have made sure that all nodes have the same config, version of packages and daemons running. So, I don't see what I am missing.

Iomsn
  • 53
  • 7
  • 2
    You can see the reason the node went to drain with `scontrol show node node6 | grep Reason`. Or on the slurm controler log files. – Keldorn Mar 26 '19 at 03:02
  • Thanks, Keldorn. That provided some useful information. We finally fixed the issue. – Iomsn Mar 27 '19 at 09:48

1 Answers1

0

Seems like the issue was connected to the present NIS. Just needed to add to the end of /etc/passwd this line:

+::::::

and restart slurmd on the node:

/etc/init.d/slurmd restart
Iomsn
  • 53
  • 7