3

I have a piece of python code that runs on a single machine with multiprocessing.Pool for lots of independent jobs. I wonder if it's possible to make it even more parallel on a SGE grid, e.g., each node of the grid runs multiple threads for these independent jobs.

Originally, I have:

# function def
# some_function(param1, param2, param3, process_index)    
func = functools.partial(some_function, file_list, param1, param2, param3)
pool = multiprocessing.Pool(processes=some_integer)
ret_list = pool.map(func, range(processes))
pool.close()

It seems to be working fine on a local machine, but if submitted onto a SGE grid as is, it indeed quit abnormally without spitting an error message. The submission command may look like this:

qsub -V -b yes -cwd -l h_vmem=10G -N jobname -o grid_job.log -j yes "python worker.py"

Ideally, I'm looking for minimum changes to the local version of python code so that it can be run on SGE grid, because it's hard to install new tools on the grid or change any grid configurations without affecting other users.

On the minimum, I understand it's possible to just rewrite the code such that the processing of each of the jobs (a file in file_list) is handled by one qsub command. But I wonder what the best practice is.

Chris Martin
  • 30,334
  • 10
  • 78
  • 137
galactica
  • 1,753
  • 2
  • 26
  • 36

1 Answers1

4

What i would do is to make Python script read file list and number of processes as command line arguments. This way it's easier to call it. I would write a Bash script which receive the file list as arguments and submits all the jobs depending on what you want to do. This way you can do two levels of parallelization : on several nodes (qsub) and several processes per node (python multiprocess). To do it the right way, you need to tell qsub the number of SLOTS you want for each job. This is done by submitting in a parallel environment and specifying a SLOT number (-pe ENV_NAME NBSLOTS) :

#!/bin/bash

NB_PROCESS_PER_JOB=2
NB_FILE_PER_JOB=3
CPT=0
BUF=""
NUMJOB=1

for i in "$@"; do
    BUF="$BUF '$i'"
    ((CPT++))
    if ((CPT == NB_FILE_PER_JOB)); then
        echo qsub -pe multithread $CPT -V -b yes -cwd -l h_vmem=10G -N jobname$NUMJOB -o grid_job.log -j yes "python worker.py $NB_PROCESS_PER_JOB $BUF"
        BUF=""
        CPT=0
        ((NUMJOB++))
    fi
done
if [[ "$BUF" != "" ]]; then
    echo qsub -pe multithread $CPT -V -b yes -cwd -l h_vmem=10G -N jobname$NUMJOB -o grid_job.log -j yes "python worker.py $NB_PROCESS_PER_JOB $BUF"
fi

The Python script would look like :

import sys

nb_processes = int(sys.argv[1])
file_list = sys.argv[2:]

pool = multiprocessing.Pool(processes=nb_processes)
ret_list = pool.map(some_function, file_list)
pool.close()

If your SGE cluster does not have any parallel environment, i suggest you do not parallelize the Python script (remove -pe ENV_NAME NBSLOTS argument and do not use pool in Python script or make it produce only one process). A simple SGE job is not supposed to be multithreaded. If a simple job is multithreaded, it uses unreserved resource and might slow down other users jobs.

Julien V
  • 875
  • 6
  • 12
  • Thanks for the helpful suggestions! Yes I agree that it's probably better just to keep SGE jobs simple. If i wanted more parallelism, i could divide the file list into more chunks and submit to more grid nodes. This way it's more friendly to the SGE scheduler, i think. – galactica Mar 30 '17 at 21:15
  • I don't know if it's more friendly to the scheduler ;-) but it seems more friendly to the other users of the SGE cluster because more little jobs will ease resource sharing. Moreover you have more chances to get many times one slot by submitting several simple jobs than submitting multithreaded jobs if the cluster is almost full because a job which asks several SLOTS has to get them on the same node (MPI is another story...). You are lucky your problem is easily parallelizable by the data so it does not require extra work to parallelize. – Julien V Mar 30 '17 at 21:41