I am currently using a cluster which uses the SGE. There, I am submitting a .sh
script which calls a python
script (which is multithreaded by using multiprocessing.pool
), to a parallel queue by calling qsub run.sh
. The python
script itself is printing some kind of progress via print(...)
. This is then appearing in the output file which is created by the SGE. Now there is a huge problem: When I execute the script manually everything works like a charm, but when I use the parallel queue at some (random) iteration the pool worker seems to stop working, as no further progress can be seen in the output file. Futhermore, the CPU usage suddenly drops to 0% and all threads of the script are just idling.
What can I do to solve this problem? Or how can I even debug it? As there are no error messages in the output file, I am really confused.
Edit: Here are some parts of the shell script which is added to the q and the necessary python files.
main.sh:
#!/bin/bash
# Use python as shell
#$ -S /bin/bash
# Preserve environment variables
#$ -V
# Execute from current working directory
#$ -cwd
# Merge standard output and standard error into one file
#$ -j yes
# Standard name of the job (if none is given on the command line):
#$ -N vh_esn_gs
# Path for the output files
#$ -o /home/<username>/q-out/
# Limit memory usage
#$ -hard -l h_vmem=62G
# array range
#$ -t 1-2
# parallel
#$ -pe <qname> 16
#$ -q <qname>
python mainscript.py
mainscript.py:
#read parameters etc [...]
def mainFunction():
worker = ClassWorker(...)
worker.startparallel()
if __name__== '__main__':
mainFunction()
whereby the ClassWorker
is defined like this:
class ClassWorker:
def _get_score(data):
params, fixed_params, trainingInput, trainingOutput, testingDataSequence, esnType = data
[... (the calculation is perfomed)]
dat = (test_mse, training_acc, params)
ClassWorker._get_score.q.put(dat)
return dat
def _get_score_init(q):
ClassWorker._get_score.q = q
def startparallel():
queue = Queue()
pool = Pool(processes=n_jobs, initializer=ClassWorker._get_score_init, initargs=[queue,] )
[...(setup jobs)]
[start async thread to watch for incoming results in the q to update the progress]
results = pool.map(GridSearchP._get_score, jobs)
pool.close()
Maybe this helps to spot the problem. I did not include the real calculation part as this has not caused any trouble on the cluster so far, so this shoul be safe.