2

I am attempting to create my own computer cluster (perhaps a Beowulf, though throwing around that term willy nilly apparently isn't cool) and have installed Slurm as my scheduler. Everything appears fine upon inputting sinfo

danny@danny5:~/Cluster/test$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      5   idle danny[1-5]
danny@danny5:~/Cluster/test$ 

However if I try and submit a job using the following script

danny@danny5:~/Cluster/test$ cat script.sh
#!/bin/bash -l
#SBATCH --job-name=JOBNUMBA0NE
#SBATCH --time=00-00:01:00
#SBATCH --partition=debug
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=100
#SBATCH -o stdout
#SBATCH -e stderr
#SBATCH --mail-type=END
#SBATCH --mail-user=dkweiss@wesleyan.edu

gfortran -O3 -i8 0-hc1.f

./a.out

I receive a lovely Submitted batch job 6, however nothing appears in squeue, and none of the expected output files materialize (the executable a.out file doesn't even appear). I will attach the associated info for scontrol show partition:

danny@danny5:~/Cluster/test$ scontrol show partition
PartitionName=debug
   AllocNodes=ALL AllowGroups=ALL Default=YES
   DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 MaxCPUsPerNode=UNLIMITED
   Nodes=danny[1-5]
   Priority=1 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
   State=UP TotalCPUs=8 TotalNodes=5 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

Any ideas?

Danny Weiss
  • 21
  • 1
  • 2

3 Answers3

2

I had the same problem, I suppose there could be more reasons why jobs just disappear without any feedback, but in my case slurm simply missed privileges. Therefore:

  1. Try to run sbatchwith sudo, if it succeed this is probably the same issue.
  2. If you are not able to try it, at least define output and error files path manually and make sure that slurm is able to write there.
user2604899
  • 99
  • 1
  • 3
  • In my case, yes, indeed, the error in the erroneous output and error. – Eugene W. Apr 09 '21 at 12:14
  • Had this happen because the output was a relative path to a directory, so the job was only working when ran from a specific location. Better to use absolute paths. – Dan Mandel Jun 07 '22 at 03:57
  • but as sudo you become root and thats the default user in slurm, what if you are not sudo, thats the real problem of the question here – stats con chris Apr 15 '23 at 23:25
1

This happened to me when the log folder did not exist (had not been created beforehand). Slurm does not automatically handle directory creation for you

0

I have seen that behaviour when the user submitting the job (here danny) does not exist with the same UID on the compute nodes. Make sure id danny reports the same output on all Slurm-related nodes. You should look for confirmation in the compute node's slurm log file.

damienfrancois
  • 52,978
  • 9
  • 96
  • 110