Why does Dynesty multiprocessing with ThreadPoolExecutor not use all cores?

Question

I am trying to run a simple example with nested sampling and dynesty.

I installed dynesty from github: https://github.com/joshspeagle/dynesty

Computer Setup

OS: Mac OSX El Capitan (10.11.6)
CPU: 8 cores
RAM: 16.0 GB

gcc: 4.8.5 via conda install gcc

Problem Setup

I ran the code below (simulate data, setup prior/likelihood, submit to dynesty.

To establish multiprocessing, I used multiprocessing.Pool, concurrent.futures.ProcessPoolExecutor, and concurrent.futures.ThreadPoolExecutor.

I tried the code in Jupyter Lab, ipython, and (script) python run_dynesty_test.py.

Problem: The entire script runs fine; but dynesty/python starts using all of the cores; then it slowly starts to use less and less cores. Finally, after about 5 minutes, dynesty/python uses almost exactly 1 core.

Evidence: htop starts reading 780%, then 550%, then 350%, then 100% CPU -- and it stays at 100% CPU for the rest of the operation; except a once every other minute, htop will read ~250-300% CPU.

Question

Why is dynesty/ThreadPoolExecutor/python/etc not using all of my cores all of the time?

Code Snippet Involving Dynesty and Multiprocessing

with ThreadPoolExecutor(max_workers=cpu_count()-1) as executor:

    sampler = dynesty.DynamicNestedSampler(
                            loglike,
                            prior,
                            ndim=ndims,
                            nparam=ndims,
                            bound='multi',
                            sample='unif',
                            pool=executor,
                            queue_size=cpu_count())

    sampler.run_nested(nlive_init=100,nlive_batch=100)

    res = sampler.results

Full Script to Setup Test

from __future__ import absolute_import,\
              unicode_literals, print_function

from multiprocessing import set_start_method
set_start_method('forkserver')

import dynesty
import math
import os
import threading, subprocess

from sys import platform
from numpy import pi, sin, cos, linspace
from pylab import *#;ion()

from multiprocessing import Pool, cpu_count

if not os.path.exists("chains"): os.mkdir("chains")

# plotting
import matplotlib
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

def gaussian1Dp(cube):
    center = cube[0]
    width  = cube[1]
    height = cube[2]
    return lambda y: height*np.exp(-0.5*(( (center - y) / width)**2))

np.random.seed(42)

param0a= -0.5
param0b= 0.5
param1a= 0.1
param1b= 0.1
param2a= 0.8
param2b= 0.8

yunc  = 0.1
nPts  = int(100)
nThPts= int(1e3)

xmin  = -1
xmax  =  1
dx    = 0.1*(xmax - xmin)

yuncs = np.random.normal(yunc, 1e-2 * yunc, nPts)
thdata= np.linspace(xmin-dx, xmax+dx, nThPts)

xdata = np.linspace(xmin,xmax,nPts)

ydata = model([param0a,param1a,param2a])(xdata) \
        + model([param0b,param1b,param2b])(xdata)

yerr  = np.random.normal(0, yuncs, nPts)
zdata = ydata + yerr

figure(figsize=(10,10))
plot(thdata, model([param0a,param1a,param2a])(thdata) \
        + model([param0b,param1b,param2b])(thdata))
errorbar(xdata, zdata, yunc*ones(zdata.size), fmt='o')
show()

def prior(cube):
    cube[0] = cube[0]*2 - 1
    cube[1] = cube[1]*2
    cube[2] = cube[2]*2

    return cube

def loglike(cube):
    modelNow = model(cube)(xdata)
    return -0.5*((modelNow - ydata)**2./yuncs**2.).sum()

from concurrent.futures import ThreadPoolExecutor,\
                               ProcessPoolExecutor

if __name__ == '__main__':
    if not os.path.exists("chains"): os.mkdir("chains")
    n_params = len(parameters)
    ndims = 3

    with ThreadPoolExecutor(max_workers=cpu_count()-1) as executor:

        sampler = dynesty.DynamicNestedSampler(
                                loglike,
                                prior,
                                ndim=ndims,
                                nparam=ndims,
                                bound='multi',
                                sample='unif',
                                pool=executor,
                                queue_size=cpu_count())

        sampler.run_nested(nlive_init=100, nlive_batch=100)

        res = sampler.results

from dynesty import plotting as dyplot

# evidence check
fig, axes = dyplot.runplot(res, color='red',
                lnz_truth=lnz_truth,
                truth_color='black',
                logplot=True)

fig.tight_layout()

joblib.dump(res,'dynesty_double_gaussian_test_results.joblib.save')

Try with `ProcessPoolExecutor` instead of `ThreadPoolExecutor` as executor. — Darkonaut, Aug 29 '18 at 22:29
Hello @Darkonaut, I tried both -- as well as multiprocessing.pool, which also works. But thank you for replying. All three had the same behaviour. So I assume it's either my MPI library (which should be OpenMPI); or it's how Dynesty uses the pool/workers. — ExoWanderer, Aug 31 '18 at 00:41
I'm not familiar with Dynesty but what generally could also be, is that the algorithm simply is not parallelizable completely and intermediate results would have to fan in at some point. — Darkonaut, Aug 31 '18 at 01:16
You can try https://johannesbuchner.github.io/UltraNest/ instead, with MPI. — j13r, Dec 24 '20 at 08:14