2

Background:
I wrote a python script to convert files from format to another. This code uses a text file (subject_list.txt) as input and iterates through source directory names listed in that text file (several hundred directories, each with thousands of files), converting their contents and storing them in a specified output directory.

Issue:
To save time, I would like to use this script on a high performance cluster (HPC) and create jobs to convert the files in parallel, rather than sequentially iterating through each directory in the list.

I am new both to python and to HPCs. Our lab had previously written primarily in BASH and had not had access to an HPC environment, but we recently gained access to the HPC and the decision has been made to switch to Python, so everything is pretty new.

Question:
Is there a module in python that will allow me to create jobs within the python script? I've found documentation on the multiprocessing and the subprocess python modules, but it isn't clear to me how I would use them. Or is there perhaps a different approach I should take? I've also read a number of posts here on stackoverflow about using slurm and python together, but I'm stymied with too much information and not enough knowledge to distinguish which thread to pick up. Any help is greatly appreciated.

Environment:
HPC: Red Hat Enterprise Linux Server release 7.4 (Maipo)
python3/3.6.1
slurm 17.11.2

Housekeeping part of the code:

# Change this for your study
group="labname"
study="studyname"

# Set paths
archivedir="/projects" + group + "/archive"
sourcedir="/projects/" + group + "shared/DICOMS/" + study
niidir="/projects/" + group + "/shared/" + study + archivedir + "/clean_niftis"
outputlog=niidir + "/outputlog_convert.txt"
errorlog=niidir + "/errorlog_convert.txt"
dcm2niix="/projects/" + group + "/shared/dcm2niix/build/bin/dcm2niix"

# Source the subject list (needs to be in your current working directory)
subjectlist="subject_list.txt" 

# Check/create the log files
def touch(path): # make a function: 
    with open(path, 'a'): # open it in append mode, but don't do anything to it yet
        os.utime(path, None) # make the file

if not os.path.isfile(outputlog): # if the file does not exist...
    touch(outputlog)
if not os.path.isfile(errorlog):
    touch(errorlog)

Part I'm stuck at:

with open(subjectlist) as file:
    lines = file.readlines() 

for line in lines:
    subject=line.strip()
    subjectpath=sourcedir+"/"+subject
    if os.path.isdir(subjectpath):
        with open(outputlog, 'a') as logfile:
            logfile.write(subject+os.linesep)

        # Submit a job to the HPC with sbatch. This next line was not in the 
        # original script that works, and it isn't correct, but it captures
        # the gist of what I am trying to do (written in bash).
        sbatch --job-name dcm2nii_"${subject}" --partition=short --time 00:60:00 --mem-per-cpu=2G --cpus-per-task=1 -o "${niidir}"/"${subject}"_dcm2nii_output.txt -e "${niidir}"/"${subject}"_dcm2nii_error.txt 

        # This is what I want the job to do for the files in each directory:
        subprocess.call([dcm2niix, "-o", "-b y",  niidir, subjectpath])

    else:
        with open(errorlog, 'a') as logfile:
            logfile.write(subject+os.linesep)

Edit 1:
dcm2niix is the software used for conversion and is available on the HPC. It takes the following flags and paths -o -b y ouputDirectory sourceDirectory .

Edit 2 (solution):

with open(subjectlist) as file:
    lines = file.readlines() # set variable name to file and read the lines from the file
for line in lines:
    subject=line.strip()
    subjectpath=dicomdir+"/"+subject
    if os.path.isdir(subjectpath):
        with open(outputlog, 'a') as logfile:
            logfile.write(subject+os.linesep)
        # Create a job to submit to the HPC with sbatch 
        batch_cmd = 'sbatch --job-name dcm2nii_{subject} --partition=short --time 00:60:00 --mem-per-cpu=2G --cpus-per-task=1 -o {niidir}/{subject}_dcm2nii_output.txt -e {niidir}/{subject}_dcm2nii_error.txt --wrap="/projects/{group}/shared/dcm2niix/build/bin/dcm2niix -o {niidir} {subjectpath}"'.format(subject=subject,niidir=niidir,subjectpath=subjectpath,group=group)
        # Submit the job
        subprocess.call([batch_cmd], shell=True)
    else:
        with open(errorlog, 'a') as logfile:
            logfile.write(subject+os.linesep)
kdestasio
  • 69
  • 1
  • 9

1 Answers1

1

This is a possible solution for your code. It has not been tested.

with open(subjectlist) as file:
    lines = file.readlines() 

for line in lines:
    subject=line.strip()
    subjectpath=sourcedir+"/"+subject
    if os.path.isdir(subjectpath):
        with open(outputlog, 'a') as logfile:
            logfile.write(subject+os.linesep)

        # Submit a job to the HPC with sbatch. This next line was not in the 
        # original script that works, and it isn't correct, but it captures
        # the gist of what I am trying to do (written in bash).
        cmd = 'sbatch --job-name dcm2nii_{subject} --partition=short --time 00:60:00\
        --mem-per-cpu=2G --cpus-per-task=1 -o {niidir}/{subject}_dcm2nii_output.txt\
        -e {niidir}/{subject}_dcm2nii_error.txt\
        --wrap="dcm2niix -o -b y {niidir} {subjectpath}"'.format(subject=subject,niidir=,subjectpath=subjectpath)

        # This is what I want the job to do for the files in each directory:
        subprocess.call([cmd], shell=True)

    else:
        with open(errorlog, 'a') as logfile:
            logfile.write(subject+os.linesep)
Carles Fenoy
  • 4,740
  • 1
  • 26
  • 27
  • This results in: Traceback (most recent call last): File "dcm2nii_batch.py", line 50, in subprocess.call([cmd]) File "/packages/python/3.6.1/lib/python3.6/subprocess.py", line 267, in call with Popen(*popenargs, **kwargs) as p: File "/packages/python/3.6.1/lib/python3.6/subprocess.py", line 707, in __init__ restore_signals, start_new_session) File "/packages/python/3.6.1/lib/python3.6/subprocess.py", line 1326, in _execute_child raise child_exception_type(errno_num, err_msg) FileNotFoundError: [Errno 2] No such file or directory: `content of cmd` – kdestasio Jan 26 '18 at 04:09
  • When I type the equivalent command in the terminal on the HPC, a job is created successfully: `sbatch --job-name test_job --partition=short --time 00:60:00 --mem-per-cpu=2G --cpus-per-task=1 -o /projects/sanlab/shared/REV/REV_scripts/org/python/output.txt -e /projects/sanlab/shared/REV/REV_scripts/org/python/error.txt --wrap="/projects/sanlab/shared/dcm2niix/build/bin/dcm2niix -o -b y /projects/sanlab/shared/REV/archive/clean_nii /projects/sanlab/shared/DICOMS/REV/subject_directory"` – kdestasio Jan 26 '18 at 05:04
  • Thank you @Carles Fenoy for your help with this! I've edited my original post to include the working code. The missing piece was to set shell=True in `subprocess.call([cmd], shell=True)`, which I tried after reading this post: http://www.sharats.me/posts/the-ever-useful-and-neat-subprocess-module/ – kdestasio Jan 26 '18 at 18:46
  • @kdestasio, thanks fir the fix. I edited the answer with the proper code – Carles Fenoy Jan 27 '18 at 10:38