4

I am currently using a cluster which uses the SGE. There, I am submitting a .sh script which calls a python script (which is multithreaded by using multiprocessing.pool), to a parallel queue by calling qsub run.sh. The python script itself is printing some kind of progress via print(...). This is then appearing in the output file which is created by the SGE. Now there is a huge problem: When I execute the script manually everything works like a charm, but when I use the parallel queue at some (random) iteration the pool worker seems to stop working, as no further progress can be seen in the output file. Futhermore, the CPU usage suddenly drops to 0% and all threads of the script are just idling.

What can I do to solve this problem? Or how can I even debug it? As there are no error messages in the output file, I am really confused.

Edit: Here are some parts of the shell script which is added to the q and the necessary python files.

main.sh:

#!/bin/bash

# Use python as shell
#$ -S /bin/bash

# Preserve environment variables
#$ -V

# Execute from current working directory
#$ -cwd

# Merge standard output and standard error into one file
#$ -j yes

# Standard name of the job (if none is given on the command line):
#$ -N vh_esn_gs

# Path for the output files
#$ -o /home/<username>/q-out/

# Limit memory usage
#$ -hard -l h_vmem=62G

# array range
#$ -t 1-2

# parallel
#$ -pe <qname> 16

#$ -q <qname>

python mainscript.py

mainscript.py:

#read parameters etc [...]
def mainFunction():
    worker = ClassWorker(...)
    worker.startparallel()

if __name__== '__main__':
    mainFunction()

whereby the ClassWorker is defined like this:

 class ClassWorker:
      def _get_score(data):
          params, fixed_params, trainingInput, trainingOutput, testingDataSequence, esnType = data
          [... (the calculation is perfomed)]
          dat = (test_mse, training_acc, params)
          ClassWorker._get_score.q.put(dat)

          return dat

    def _get_score_init(q):
         ClassWorker._get_score.q = q

    def startparallel():
         queue = Queue()
         pool = Pool(processes=n_jobs, initializer=ClassWorker._get_score_init, initargs=[queue,] )
         [...(setup jobs)]
         [start async thread to watch for incoming results in the q to update the progress]

         results = pool.map(GridSearchP._get_score, jobs)
         pool.close()

Maybe this helps to spot the problem. I did not include the real calculation part as this has not caused any trouble on the cluster so far, so this shoul be safe.

zimmerrol
  • 4,872
  • 3
  • 22
  • 41
  • As SGE runs your script on another machine, you should first check the differences of environment between different machines. Some of the differences may trigger a deadlock. Without seeing the code, we can't really help. – gdlmx May 04 '17 at 15:47
  • @gdlmx I have tested the code on the machine on which the SGE will execute the code. There this "deadlock" behaviour did not occur and the code ran smoothly. – zimmerrol May 04 '17 at 17:38
  • @gdlmx I added parts of my code - maybe this helps! – zimmerrol May 04 '17 at 17:47
  • 1
    As you used the SGE *array job* option `#$ -t 1-2`, two instants of the script (`python mainscript.py`) will be run *simultaneously* under the same working directory. If both of them try accessing a common resource (e.g. an output file), deadlock may occur. Did you consider this problem? To test it, you should run two instants of your python script at the same time. – gdlmx May 08 '17 at 11:07
  • Another SGE option `-pe 16` may also cause the problem. Apparently you should not use `` here. The correct usage is `-pe `, where `pe_name` is the "name of the parallel environment as defined for pe_name in sge_types(1)". – gdlmx May 08 '17 at 11:15
  • I assume that by simply deleting both `-t` and `-pe` options, your problem will be solved. SGE has no idea how python parallel environment handles its pool of subprocesses, so you shouldn't play with those options unless you understand the "SGE parallel environment configuration" file completely. – gdlmx May 08 '17 at 11:25
  • @gdlmx I used the `-t 1-2` just to test the script - I would like to run it with a much broader range later. The only IO that happens is the printing of the progress and in the beginning the script reads one file from the HDD - so there should not be a deadlock... – zimmerrol May 09 '17 at 17:24
  • Sorry, I mixed the variables in the replacement up. `` should be ``. – zimmerrol May 09 '17 at 17:25
  • I also think it might be related to potential deadlock because of ```-t 1-2```. Also, have you checked your python script does not exceed any limit ? like RAM or execution time ? Also, why don't you specify explicitly how many threads multiprocess will start ? – Julien V Sep 17 '17 at 17:20

0 Answers0