sbatch duplicating tasks across nodes instead of spreading tasks across nodes in SLURM

Question

I have a program that takes an input file describing a range of initial conditions and outputs a range of final conditions. I also have a batch script that "parallelizes" the program by just breaking up the initial condition ranges in the input file into smaller chunks and feeding them into independent instances of the program. This seems to work fine as long as I only try to run the batch script on one node, but if I request more than one node, each instance of the program is duplicated on each node.

Here's a vastly simplified version of the batch script I'm using that duplicates the problem:

---my_job_splitter.sh---
#!/bin/env bash

#SBATCH --job-name Test
#SBATCH --output=%x_%j_%t.csv
#SBATCH --error=log.err
#SBATCH --mail-type=END,FAIL
#SBATCH --nodes=4

#Read the command line input
if [ "$#" -ge 4 ]; then
    numtasks=${1}
    inptrangestart=${2}
    inptrangeend=${3}
    inptrangenum=${4}
fi

#Calculate the size of the chunks to break the range into
chunksize=$((inptrangenum/numtasks))

#Run a separate instance my_program to process each smaller chunk of the input range
for ((ii=0;ii<numtasks;ii++)); do
    stp=`echo "scale=4;($inptrangeend-$inptrangestart)/($inptrangenum-1)" | bc`
    a=`echo "$chunksize*$stp*$ii" | bc`
    b=`echo "$a+($chunksize-1)*$stp" | bc`
    srun my_program.sh $a $b $chunksize &
done

wait

For illustration purposes, my_program is just a bash script that takes the input range and writes it to stdout as a csv line:

---my_program.sh---
#!/bin/env bash
echo "$1,$2,$3"

If everything were doing what I want, if I run the command sbatch my_job_splitter 32 0 1000 4000 the output file should be a CSV file with 32 entries, each with 1/32 of the range 0:1000, but instead I get a CSV file with 96 entries and each range chunk is duplicated 3 times. I think I understand what's going on--each time I run srun, it sees that I've allocated 3 nodes and assumes I want 1 task per node and so it just duplicates the task until it has assigned 1 task to each node--but I don't know how to fix it or whether this is a stupid way to do it in the first place.

Other things I have tried:

Using the --exclusive flag on srun: this just causes srun to only use one node and ignore the other allocated nodes
Not using srun at all: This seems to have the same effect as using srun --exclusive

Try telling `srun` to use only one node (`--nodes 1`), although I'm more comfortable telling SLURM how many tasks I need and letting it to allocate the number of nodes it considers appropriate. — Poshi, Oct 24 '19 at 18:59
That also causes it to only use one node out of the 4 allocated. As does using ```--ntasks 1```, even when I also specify ```--cpus-per-task 1```. If I only specify ```--cpus-per-task 1```, then it duplicates the task across all cpus and across all nodes. — Beezum, Oct 25 '19 at 15:26
I suppose the obvious fix is to just write all of the split inputs to numbered temporary text files and use a job array, but that's just so *ugly*, especially when I'm splitting it up into 80 or 90 individual pieces. — Beezum, Oct 25 '19 at 15:44
Well, using one node out of 4 is the expected result, right? You want your `my_program.sh` to be run in a single node, right? On the other hand, you can use a job array without having to create the intermediate files you are talking about: just compute their information on the fly. — Poshi, Oct 25 '19 at 18:09
Yes, I want any given instance of `my_program.sh` to run only once on a single node, but if I have allocated, say, nodes 1, 2, 3, and 4 to the overall job and the batch script is running on node 1, then *all* of the instances of `my_program.sh` get run on node 1 and either nodes 2, 3, and 4 don't do anything; or nodes 2, 3, and 4 each run duplicate instances with duplicate inputs of the ones running on node 1. There doesn't seem to be any way to tell the scheduler to put 1/4 of the instances of `my_program.sh` on node 1, 1/4 of them on node 2, etc. — Beezum, Oct 25 '19 at 19:13
If you tell `srun` to run the task in one node, the others will be idle, but as soon as the second `srun` arrives again and looks for a free node, the second task will be run in one of the three free nodes. Same for the subsequent tasks. — Poshi, Oct 26 '19 at 07:50
So I'm still not sure what's going on, because if I have a line in `my_program.sh` that prints the value of `$SLURM_NODEID`, all of the instances still print the same node, but if I open the status viewer for the cluster, it shows all of the allocated processors, even the ones on different nodes, as being heavily used. So it looks like it's at least doing what I want, now. I'll post the options I used as an answer. Thanks for your help @Poshi — Beezum, Oct 31 '19 at 14:23

score 0 · Answer 1 · answered Oct 31 '19 at 14:31

So what I ended up doing was using srun within the batch script with the following options:

srun --exclusive --cpus-per-task=1 --ntasks=1 my_program.sh $a $b $chunksize &

This seems to spread all of the individual tasks across all allocated nodes without any duplications.

sbatch duplicating tasks across nodes instead of spreading tasks across nodes in SLURM

1 Answers1