I have a program that takes an input file describing a range of initial conditions and outputs a range of final conditions. I also have a batch script that "parallelizes" the program by just breaking up the initial condition ranges in the input file into smaller chunks and feeding them into independent instances of the program. This seems to work fine as long as I only try to run the batch script on one node, but if I request more than one node, each instance of the program is duplicated on each node.
Here's a vastly simplified version of the batch script I'm using that duplicates the problem:
---my_job_splitter.sh---
#!/bin/env bash
#SBATCH --job-name Test
#SBATCH --output=%x_%j_%t.csv
#SBATCH --error=log.err
#SBATCH --mail-type=END,FAIL
#SBATCH --nodes=4
#Read the command line input
if [ "$#" -ge 4 ]; then
numtasks=${1}
inptrangestart=${2}
inptrangeend=${3}
inptrangenum=${4}
fi
#Calculate the size of the chunks to break the range into
chunksize=$((inptrangenum/numtasks))
#Run a separate instance my_program to process each smaller chunk of the input range
for ((ii=0;ii<numtasks;ii++)); do
stp=`echo "scale=4;($inptrangeend-$inptrangestart)/($inptrangenum-1)" | bc`
a=`echo "$chunksize*$stp*$ii" | bc`
b=`echo "$a+($chunksize-1)*$stp" | bc`
srun my_program.sh $a $b $chunksize &
done
wait
For illustration purposes, my_program
is just a bash script that takes the input range and writes it to stdout
as a csv line:
---my_program.sh---
#!/bin/env bash
echo "$1,$2,$3"
If everything were doing what I want, if I run the command
sbatch my_job_splitter 32 0 1000 4000
the output file should be a CSV file with 32 entries, each with 1/32 of the range 0:1000, but instead I get a CSV file with 96 entries and each range chunk is duplicated 3 times. I think I understand what's going on--each time I run srun, it sees that I've allocated 3 nodes and assumes I want 1 task per node and so it just duplicates the task until it has assigned 1 task to each node--but I don't know how to fix it or whether this is a stupid way to do it in the first place.
Other things I have tried:
- Using the
--exclusive
flag onsrun
: this just causessrun
to only use one node and ignore the other allocated nodes - Not using
srun
at all: This seems to have the same effect as usingsrun --exclusive