0

I'm trying to submit multiple jobs in parallel as a preprocessing step in sbatch using srun. The loop reads a file containing 40 file names and uses "srun command" on each file. However, not all files are being sent off with srun and the rest of the sbatch script continues after the ones that did get submitted finish. The real sbatch script is more complicated and I can't use arrays with this so that won't work. This part should be pretty straightforward though.

I made this simple test case as a sanity check and it does the same thing. For every file name in the file list (40) it creates a new file containing 'foo' in it. Every time I submit the script with sbatch it results in a different number of files being sent off with srun.

#!/bin/sh
#SBATCH --job-name=loop
#SBATCH --nodes=5
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:10:00
#SBATCH --mem-per-cpu=1G
#SBATCH -A zheng_lab
#SBATCH -p exacloud
#SBATCH --error=/home/exacloud/lustre1/zheng_lab/users/eggerj/Dissertation/splice_net_prototype/beatAML_data/splicing_quantification/test_build_parallel/log_files/test.%J.err
#SBATCH --output=/home/exacloud/lustre1/zheng_lab/users/eggerj/Dissertation/splice_net_prototype/beatAML_data/splicing_quantification/test_build_parallel/log_files/test.%J.out

DIR=/home/exacloud/lustre1/zheng_lab/users/eggerj/Dissertation/splice_net_prototype/beatAML_data/splicing_quantification/test_build_parallel
SAMPLES=$DIR/samples.txt
OUT_DIR=$DIR/test_out
FOO_FILE=$DIR/foo.txt

# Create output directory
srun -N 1 -n 1 -c 1 mkdir $OUT_DIR

# How many files to run
num_files=$(srun -N 1 -n 1 -c 1 wc -l $SAMPLES)
echo "Number of input files: " $num_files

# Create a new file for every file in listing (run 5 at a time, 1 for each node)
while read F  ;
do
    fn="$(rev <<< "$F" | cut -d'/' -f 1 | rev)" # Remove path for writing output to new directory
    echo $fn
    srun -N 1 -n 1 -c 1 cat $FOO_FILE > $OUT_DIR/$fn.out &
done <$SAMPLES
wait

# How many files actually got created
finished=$(srun -N 1 -n 1 -c 1 ls -lh $OUT_DIR/*out | wc -l)
echo "Number of files submitted: " $finished

Here is my output log file the last time I tried to run it:

Number of input files:  40 /home/exacloud/lustre1/zheng_lab/users/eggerj/Dissertation/splice_net_prototype/beatAML_data/splicing_quantification/test_build_parallel/samples.txt
sample1
sample2
sample3
sample4
sample5
sample6
sample7
sample8
Number of files submitted:  8
jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
jaegger
  • 58
  • 1
  • 6
  • Can you confirm that the SAMPLES file is the same at both locations (i see you are using lustre but this would be a first check) – tomgalpin Jan 31 '20 at 10:17
  • The SAMPLES file is in the location specified as $DIR. It doesn't seem to have an issue finding it as there's no error given back saying it can't. The SAMPLES file just contains: /path/to/file/sample1 /path/to/file/sample2 and so forth. The strange thing is that the number that goes through is different every time (8,6,12, etc.). In the real example I'm sorting bam files with sambama, but I do get more (not all) running: 32,38,29, etc. It's always the first N (i.e. 32) that gets sent out. – jaegger Jan 31 '20 at 19:33

1 Answers1

1

The issue is that srun redirects its stdin to the tasks it starts, and therefore the contents of $SAMPLES is consumed, in an unpredictable way, by all the cat commands that are started.

Try with

srun --input none -N 1 -n 1 -c 1 cat $FOO_FILE > $OUT_DIR/$fn.out &

The --input none parameter will tell srun to not mess with stdin.

damienfrancois
  • 52,978
  • 9
  • 96
  • 110
  • Interesting. I got it working by first loading each line into an array and then parsing the array with a for loop and running srun within that. I'll have to give this a shot. – jaegger Feb 10 '20 at 18:21
  • I think this should be the accepted answer. What @Julian did (see the coment above) avoids the problem by saving the content of `$SAMPLES` outside of `stdin` – Felix Aug 05 '20 at 19:03