I am running a 30-core Python job on a cluster using MPIPool
. When I delete the job through the ubiquitous qdel <job ID>
command, only the parent is killed, while the child processes continue to run. In other words: qdel
makes the job-ID disappear from the queue, but the 30 (= number of cores) initiated Python processes remain present in the background, contributing heavily to the cluster load. Furthermore, I can only ''manually'' kill the background processes on the one node I am logged into.
Another thing that complicates matters is the fact that my Python script calls on a piece of Fortran code (I am using the f2py
module to chieve this). I have noticed in the past, when running the programme locally, that that Fortran does not respond to a Ctrl+C interrupt. The programme is aborted once it arrives at the Python layer again.
I have consulted the documentation relating to MPIPool
, which I use to parallelise the job, but I did not manage to pinpoint where exactly things go wrong. Ideally, I would like a child process to call on its parent regularly and to terminate itself when it notices that the parent is no longer there. At the moment it seems that deleting the job simply cuts the rope that ties parent and child together, without deleting the child.
The snippet below shows how the pool object is integrated my main code. In addition I use a bash script to submit a job to the cluster queue (containing echo 'mpirun -np '$NCORES' python '$SKRIPTNAME >> $TMPFILE
) and request the number of cores I want to use. The latter should work fine.
import emcee
from emcee.utils import MPIPool
pool = MPIPool()
if not pool.is_master():
pool.wait()
sys.exit(0)
sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob, pool = pool)
pos, prob, state = sampler.run_mcmc(p0, 1000) # p0 contains the initial walker positions
pool.close()
Background: I use the emcee
module to carry out a Monte Carlo Simulation. lnprob
is a likelihood function that is evaluated for the parameter set being drawn in a particular iteration. lnprob
calls on a Fortran script that handles the computationally expensive parts.
Edit: Please find below a minimal script for which the issue still occurs. I have been able to verify that f2py
is apparently not causing the problems:
import numpy as np
import sys
import emcee
from emcee.utils import MPIPool
def calc_log_prob(a,b,c,d):
for i in np.arange(1000):
for j in np.arange(1000):
for k in np.arange(1000):
for g in np.arange(1000):
x = i + j + k + g
return -np.abs(a + b)
def lnprob(x):
return calc_log_prob(*x)
ndim, nwalkers = 4, 180
p0 = [np.array([np.random.normal(loc = -5.5, scale = 2., size=1)[0], \
np.random.normal(loc = -0.3, scale = 1., size=1)[0], \
0.+3000.*np.random.uniform(size=1)[0], \
-6.+3.*np.random.uniform(size=1)[0]]) for i in range(nwalkers)]
with MPIPool() as pool:
if not pool.is_master():
# Wait for instructions from the master process.
pool.wait()
sys.exit(0)
sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob, pool = pool)
pos, prob, state = sampler.run_mcmc(p0, 560)
pool.close()
This script closely follows the example outlined in the emcee
documentation, with pool
correctly incorporated. To be honest, I am completely clueless as to where the source of this malfunctioning hides. I am almost inclined to say that the issue is more cluster-related.