Have access to a big compute cluster that uses MPI and SLURM. I successfully got a parallel IPython ipcluster
to run using the Slurm job submission system.
I can start it for example with
ipcluster start -n 320 --profile=slurm
And it successfully submits the two jobs to start the controller and the engines
2016-07-12 11:41:37.055 [IPClusterStart] Starting ipcluster with [daemon=False]
2016-07-12 11:41:37.057 [IPClusterStart] Creating pid file: /draco/u/USERNAME/.ipython/profile_slurm/pid/ipcluster.pid
2016-07-12 11:41:37.057 [IPClusterStart] Starting Controller with SlurmControllerLauncher
2016-07-12 11:41:37.079 [IPClusterStart] Job submitted with job id: u'9908'
2016-07-12 11:41:38.080 [IPClusterStart] Starting 320 Engines with SlurmEngineSetLauncher
2016-07-12 11:41:38.103 [IPClusterStart] Job submitted with job id: u'9909'
2016-07-12 11:42:08.129 [IPClusterStart] Engines appear to have started successfully
The first issue I have is the statement Engines appear to have started successfully
which is not true, because in many cases, the job that starts the engines will have to wait some time in the queue before it can run since it requests much more resources.
This brings me to my actual issue: if I request say 2 hours, then the single core job with the controller will start immediately, but the job starting the engines will wait, say, 1 hour in the queue, then after 1 hour of computing on the engines, the engines stay alive, but the controller is killed.
Is there a way where all of this happens in a single job where one process will be the controller and the other processes are the engines? This way they would all start at around the same time.
I know I could just request much more time for the controller job, but that doesn't seem like a clean solution to me.
EDIT:
I just stumbled over this github issue, where there is a solution to put both in one submit script, but can that still be done with ipcluster
, somehow telling it to just run this script. Is a bit of an overhead, but it would be nice to have the syntax alway be
ipcluster start -n N --profile=whatever