3

I am having trouble launching a SLURM job calling a mpirun subprocess from a python script. Inside the python script (let's call it script.py) I have this subprocess.run:

 import subprocess
 def run_mpi(config_name, np, working_dir):

    data_path = working_dir + "/" + config_name 
    subprocess.run(
       [
          "mpirun -np "
          + np
          + " "
          + working_dir
          + "/spk_mpi -echo log < "
          + data_path
          + "/in.potts"
        ],
     # mpirun -np 32 spk_mpi -echo log < /$PATH/in.potts
     check=True,
     stderr=subprocess.PIPE,
     universal_newlines=True,
     stdout=subprocess.PIPE,
     shell=True,
    )

I then execute the script by submitting a SLURM job to a cluster node by something like:

#!/bin/bash
#SBATCH --job-name=myjob            
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=2-00:00:00               # Time limit hrs:min:sec
#SBATCH --partition=thin

python script.py --working_dir=$PATH --np=$SLURM_NTASKS

but somehow the subprocess is never executed. I also tried with changing the format of the subprocess to shell=False but get returned non-zero exit status 1 (i might do something wrong while parsing the input).

Note that if i don't submit the script as a job i am able to execute the subprocess run; this is only happening with the batch job - if I first allocate resources with salloc and then run an interactively job I don't run into this issue as well.

I'm not 100% sure, but it might be that when spawning a subprocess, that process doesn't have the SLURM configuration variables passed properly, so it doesn't know over which nodes to parallelize.

Any hint how to fix that?

UPDATE: I could fix calling mpirun directly from BATCH file. As the input to it was changing according a path indicated in a config_file, I solved by reading the file from command line:

#SBATCH --ntasks=32

while IFS= read -r line; do
    path="$(echo -e "${line}" | sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//')"

    echo "Processing: $path/in.potts_am_IN100_3d"
    mpirun -np 32 ${SPPARKS}/spk_mpi -echo log < ${SPPARKS}/${path}/in.potts

done < "$config_file"
Betelgeuse
  • 682
  • 2
  • 8
  • 27
  • 1. Are you sure the subprocess is not executed? Have you tried a hello world program? 2. Shell input with "<" to an mpi process is dangerous. Certainly if that mpi process is not called directly. Specify a file with your input data and read that. 3 Why not run the python process in parallel, and do the sequential parts only on process zero? – Victor Eijkhout May 26 '23 at 13:50
  • The mpirun is supposed to generate some files. By submitting the script via batch job, the script is executed but "get stuck" into the subprocess call (the job has been running for 16 hours by now and I don't see any ouput file) – Betelgeuse May 26 '23 at 13:51
  • I understand what is supposed to happen. Since it doesn't you should debug it. I have given you three suggestions. – Victor Eijkhout May 26 '23 at 15:50

0 Answers0