SLURM Array jobs - how to run as many job as possible? How to combine Slurm options most sensibly?

Question

I am quite new to Slurm and this community, so plese correct me in any way if I am doing anything wrong! :)

I need to run my executable (a Python script) many times in parallel on a HPC Cluster. This executable takes the Slurm Array task ID as Input. This input is mapped within the Python script onto several parameters, on basis of which in turn again data is imported. Note that the exectutable itself is not internally parallelised. I think that each invocation of my executable should be able to run on one CPU only.

My aim: run many invocations of my executable as many times as possible! I was thinking at least like 50 invocations concurrently.

In principle, my scripts are working as intended on the cluster. I use this Slurm submission script:

#!/bin/bash -l

#SBATCH --job-name=NAME
#SBATCH --chdir=/my/dir
#SBATCH --output=.job/NAME%A_%a.out
#SBATCH --error=.job/NAME%A_%a.err
#SBATCH --mail-type=END
#SBATCH --mail-user=USER

# --- resource specification ---
#SBATCH --partition=general
#SBATCH --array=1-130
#SBATCH --ntasks-per-node=1
#SBATCH --mem=16G
#SBATCH --time=13:00:00

# --- start from a clean state and load necessary environment modules ---
module purge
module load anaconda/3

# --- instruct OpenMP to use the number of cpus requested per task ---
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}

# --- run executable via srun ---
srun ./path/to/executable.py $SLURM_ARRAY_TASK_ID

However, this way, somehow only 8 jobs (that is, 'executable.py 1', 'executable.py 2', ...) get executed in parallel, each on a different node. (Note: I don't quite know what 'export OMP_NUM_THREADS' does; I was told to include it by IT support). If 'executable.py 1' ends, 'executable.py 9' starts. However, I want more than just 8 concurrently running invocations. So I thought, I need to specify that each inovcation only needs one CPU; maybe then many more of my jobs can run in parallel on the 8 nodes I somehow seem to receive. My new submission script looks like this (for readability I only show the 'resource specification' part, the rest was not changed):

# --- resource specification ---
#SBATCH --partition=general
#SBATCH --array=1-130
#SBATCH --ntasks-per-node=10
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --time=13:00:00

This way, though, it seems that my executable gets run ten times for each Slurm array task ID, that is, 'executable.py 1' is run ten times, as is 'executable.py 2' and so on. This is not what I intended.

I think at the bottom of my problem is that (i) I am serioulsy cofused by the SBATCH options --ntasks-per-node, --ntasks, --cpus-per-task, --nodes, etc., and (ii) I don't know conceptually really what a 'job', 'job step' or 'task' is meant to be (both, for my case as well as on the man page for SBATCH).

If anyone knows which SBATCH option combination gives me what I want, I would be very grateful for a hint. Also, if you have general knowledge (in plain English) on how job steps and tasks etc. can be defined, that would be so great.

Please note that I extensively stared at the man pages and some online documentations. I also asked my local I support, but sadly they were not awfully helpful. I really need my script to run in parallel on a huge scale; I also really want to understand a bit better the workings of Slurm. I shall like to add that I am not a computer scientist by training, this is not my usual playing field.

Thanks so much for your time everyone!

The `OMP_NUM_THREADS` variable tells the system how many threads should be used for an OpenMP program. Your code is not paralellized, so that line is useless (it doesn't bother, but it doesn't help either). Regarding your main issue, with the first resource specification it should start as many jobs as possible, each one with a single independent calculus. You complain about only 8 being concurrently running. Did you checked that there were free resources for more jobs with your requirements and that there were no other more prioritaire jobs waiting for those resources? — Poshi, May 17 '20 at 10:31
Thank your Poshi. What exactly do you mean with 'single independent calculus'? I want to invoke each job with a separate CPU, not a separate node (I also don't want to block ressources for other users), as I think that should be sufficient. Can I somehow request X nodes so that I get a total of 130 CPUS, on which all my jobs are running completely in parallel (given memory requirements of the nodes)? — cheshire, May 17 '20 at 13:44
As to whether I checked the available ressources: in general, the partition of the cluster I am working on has 773 general compute nodes with 32 cores per node (each with 2 hyperthreads, thus 64 logical CPUs per node), and memory of 768 x 128 GB, 1 x 256 GB, 4 x 512 GB. There is also another cluster I could use, which as 3240 compute nodes with 40 cores per node (each with 2 hyperthreads, thus 80 logical CPUs per node), and has memory 1284 x 96 GB, 1932 x 192 GB,16 x 384 GB, 8 x 768 GB. — cheshire, May 17 '20 at 13:46
I monitored my job array for several hours, and never have there been executed more than 8 jobs in parallel. The pre-execution queue is quite long, I did not meticulously check, though, how the other queued and more prioritised jobs looked like. Or is there an easy way to do that? Are you absolutely positive that my first ressource specification should achieve what I am trying to? I just want to be sure that I am not mis-specifying things or leave anything to chance. Thank you so much for your input! — cheshire, May 17 '20 at 13:47
Your ton of questions cannot be answered in a few comments. You have to look to documentation for most of your doubts, as this site is for helping you with specific programming issues. "single independent calculus"-> "calculus" is what you are computing, "independent" because your different executions don't depend one from the other, "single" because each job in the job array will only start one of those calculus. — Poshi, May 17 '20 at 13:51
"I also don't want to block ressources for other users"->wrong concept. Every time you send a job, you are asking TO BLOCK resources for YOU, otherwise you will never have resources for your own calculus. — Poshi, May 17 '20 at 13:52
You can ask for enough nodes at once, but this poses a set of issues on the table: 1) you will have to wait for all the resources to be free simultaneously, but you don't really need that, as your calculus are independent. This will make your single job to wait longer to start, and the resources that are being blocked for you will be unavailable to anyone while your job is waiting. 2) Probably, not all the invocations will take the same time to finish, which imlies that the resources that finish early will be wasted, as they are blocked until your entire job finish. — Poshi, May 17 '20 at 13:55
In case you want to try, you can fit up to 8 jobs in a node (8*16=128GiB) and you have 130 jobs, so you need to ask for 17 full nodes, 8 tasks per node: --nnodes=17, --tasks-per-node=8, --mem=128G; and in srun, you have to tell every invocation to run in the background with a single task: srun --ntasks=1 ...& — Poshi, May 17 '20 at 14:00
If the execution queue is long, chances are that you were just waiting because there were no free reources at that time for scheduling your next jobs. — Poshi, May 17 '20 at 14:01
Thank you for your patience. With 'blocking' I more meant kind of blocking a whole node by accidentally requesting to much memory / CPUs by falsely thinking I am invoking many jobs on the node, although I am only invoking one. — cheshire, May 17 '20 at 14:16
I see, so after adujsting my slurm array job script, I woul d finish it with `srun --ntask=1 ./path/to/executable.py $SLURM_ARRAY_TASK_ID`? But I understand that that would result in my whole array just waiting longer. If you have a good documentation / guide for Slurm, please do let me know. — cheshire, May 17 '20 at 14:18
You forgot to put the task in background. Look carefully: there was an ampersand in my comment with the execution line. It should be `srun --ntask=1 ./path/to/executable.py "$SLURM_ARRAY_TASK_ID" &` — Poshi, May 17 '20 at 14:31

SLURM Array jobs - how to run as many job as possible? How to combine Slurm options most sensibly?

0 Answers0