9

I'm trying to automate a process that at some point needs to draw samples from a truncated multivariate normal. That is, it's a normal multivariate normal distribution (i.e. Gaussian) but the variables are constrained to a cuboid. My given inputs are the mean and covariance of the full multivariate normal but I need samples in my box.

Up to now, I'd just been rejecting samples outside the box and resampling as necessary, but I'm starting to find that my process sometimes gives me (a) large covariances and (b) means that are close to the edges. These two events conspire against the speed of my system.

So what I'd like to do is sample the distribution correctly in the first place. Googling led only to this discussion or the truncnorm distribution in scipy.stats. The former is inconclusive and the latter seems to be for one variable. Is there any native multivariate truncated normal? And is it going to be any better than rejecting samples, or should I do something smarter?

I'm going to start working on my own solution, which would be to rotate the untruncated Gaussian to it's principal axes (with an SVD decomposition or something), use a product of truncated Gaussians to sample the distribution, then rotate that sample back, and reject/resample as necessary. If the truncated sampling is more efficient, I think this should sample the desired distribution faster.

Warrick
  • 697
  • 10
  • 19

5 Answers5

7

So, according to the Wikipedia article, sampling a multivariate truncated normal distribution (MTND) is more difficult. I ended up taking a relatively easy way out and using an MCMC sampler to relax an initial guess towards the MTND as follows.

I used emcee to do the MCMC work. I find this package phenomenally easy-to-use. It only requires a function that returns the log-probability of the desired distribution. So I defined this function

from numpy.linalg import inv

def lnprob_trunc_norm(x, mean, bounds, C):
    if np.any(x < bounds[:,0]) or np.any(x > bounds[:,1]):
        return -np.inf
    else:
        return -0.5*(x-mean).dot(inv(C)).dot(x-mean)

Here, C is the covariance matrix of the multivariate normal. Then, you can run something like

S = emcee.EnsembleSampler(Nwalkers, Ndim, lnprob_trunc_norm, args = (mean, bounds, C))

pos, prob, state = S.run_mcmc(pos, Nsteps)

for given mean, bounds and C. You need an initial guess for the walkers' positions pos, which could be a ball around the mean,

pos = emcee.utils.sample_ball(mean, np.sqrt(np.diag(C)), size=Nwalkers)

or sampled from an untruncated multivariate normal,

pos = numpy.random.multivariate_normal(mean, C, size=Nwalkers)

and so on. I personally do several thousand steps of sample discarding first, because it's fast, then force the remaining outliers back within the bounds, then run the MCMC sampling.

The number of steps for convergence is up to you.

Note also that emcee easily supports basic parallelization by adding the argument threads=Nthreads to the EnsembleSampler initialization. So you can make this blazing fast.

Joe
  • 575
  • 6
  • 24
Warrick
  • 697
  • 10
  • 19
7

I have reimplemented an algorithm which does not depend on MCMC but creates independent and identically distributed (iid) samples from the truncated multivariate normal distribution. Having iid samples can be very useful! I used to also use emcee as described in the answer by Warrick, but for convergence the number of samples needed exploded in higher dimensions, making it impractical for my use case.

The algorithm was introduced by Botev (2016) and uses an accept-reject algorithm based on minimax exponential tilting. It was originally implemented in MATLAB but reimplementing it for Python increased the performance significantly compared to running it using the MATLAB engine in Python. It also works well and is fast at higher dimensions.

The code is available at: https://github.com/brunzema/truncated-mvn-sampler.

An Example:

d = 10  # dimensions

# random mu and cov
mu = np.random.rand(d)
cov = 0.5 - np.random.rand(d ** 2).reshape((d, d))
cov = np.triu(cov)
cov += cov.T - np.diag(cov.diagonal())
cov = np.dot(cov, cov)

# constraints
lb = np.zeros_like(mu) - 1
ub = np.ones_like(mu) * np.inf

# create truncated normal and sample from it
n_samples = 100000
tmvn = TruncatedMVN(mu, cov, lb, ub)
samples = tmvn.sample(n_samples)

Plotting the first dimension results in:

First dimension of truncated MVN


Reference:

Botev, Z. I., (2016), The normal law under linear restrictions: simulation and estimation via minimax tilting, Journal of the Royal Statistical Society Series B, 79, issue 1, p. 125-148

colts44
  • 71
  • 1
  • 3
1

Simulating truncated multivariate normal can be tricky and usually involves some conditional sampling by MCMC.

My short answer is, you can use my code (https://github.com/ralphma1203/trun_mvnt)!!! It implements the Gibbs sampler algorithm from Li and Ghosh (2015), which can handle general linear constraints in the form of foo+bar, even when you have non-full rank D and more constraints than the dimensionality.

import numpy as np
from trun_mvnt import rtmvn, rtmvt

########## Traditional problem, probably what you need... ##########
##### lower < X < upper #####
# So D = identity matrix

D = np.diag(np.ones(4))
lower = np.array([-1,-2,-3,-4])
upper = -lower
Mean = np.zeros(4)
Sigma = np.diag([1,2,3,4])

n = 10 # want 500 final sample
burn = 100 # burn-in first 100 iterates
thin = 1 # thinning for Gibbs


random_sample = rtmvn(n, Mean, Sigma, D, lower, upper, burn, thin) 
# Numpy array n-by-p as result!
random_sample

########## Non-full rank problem (more constraints than dimension) ##########
Mean = np.array([0,0])
Sigma = np.array([1, 0.5, 0.5, 1]).reshape((2,2)) # bivariate normal

D = np.array([1,0,0,1,1,-1]).reshape((3,2)) # non-full rank problem
lower = np.array([-2,-1,-2])
upper = np.array([2,3,5])

n = 500 # want 500 final sample
burn = 100 # burn-in first 100 iterates
thin = 1 # thinning for Gibbs

random_sample = rtmvn(n, Mean, Sigma, D, lower, upper, burn, thin) # Numpy array n-by-p as result!
Ralph
  • 33
  • 3
1

I wrote a script to measure the run time of the solutions provided so far. To draw 100 samples of a distribution of dimension 50 the run times are

  1. Botev 2016 implementation 0.029579 s
  2. Li and Ghosh 2015 implementation 2.597150 s
  3. using emcee 217.914969 s

For comparison, the sampling without any truncation using Cholesky decomposition takes 0.000201 seconds.

For this specific scenario, the Botev 2016 implementation is the fastest method to sample from a truncated multivariate normal distributions. The emcee approach is much slower, but maybe it can be tuned for better performance.

The output was produced with the following code on a x86 machine with 6 cores.


import emcee
from minimax_tilting_sampler import TruncatedMVN
from trun_mvnt import rtmvn
import numpy as np
from scipy import linalg
from multiprocessing import Pool

import time

def log_prob_mvtnd(x, mean, lb, ub, vcm):
    if np.any(np.less(x,lb)) or np.any(np.greater(x,ub)):
        return -np.inf
    else:
        diff = x - mean
        return -0.5 * np.dot(diff, np.linalg.solve(vcm, diff))

def tmvnd_emcee(vcm,lb,ub,n,mean=None):
    
    ndim = vcm.shape[0]
    
    if not mean:
        mean = np.zeros(ndim)
    
    with Pool() as pool:
        sampler = emcee.EnsembleSampler(nwalkers=n,
                                        ndim=ndim,
                                        log_prob_fn=log_prob_mvtnd,
                                        args = (mean, lb,ub, vcm),
                                        pool=pool
                                        )
    
        p0 = np.random.multivariate_normal(mean, vcm, size=n)
        # p0 = emcee.utils.sample_ball(mean, np.sqrt(np.diag(vcm)), size=n)
        
        state = sampler.run_mcmc(p0, 10)
        sampler.reset()
        sampler.run_mcmc(state, 100);

    return sampler.get_last_sample().coords

def mvnd_chol(vcm,n):
    gauss_samples = np.random.randn(vcm.shape[0],n)
    R = linalg.cholesky(vcm, lower=True)
    return np.dot(R,gauss_samples)
    
def tmvnd_botev16(vcm,lb,ub,n):
    tmvn = TruncatedMVN(np.zeros(vcm.shape[0]), vcm, lb, ub)
    return tmvn.sample(n)

def tmvnd_Li_Ghosh_15(vcm,lb,ub,n):
    burn = 100 # burn-in first 100 iterates
    thin = 1 # thinning for Gibbs

    D =np.eye((vcm.shape[0]))
    return rtmvn(n, np.zeros(vcm.shape[0]), vcm, D, lb, ub, burn, thin).T 

if __name__ == '__main__':
    
    ndim=50
    nsamples = 100
    sigma = np.exp(-1*np.linspace(0, 10, ndim))
    vcm = linalg.toeplitz(sigma)
    
    trunc_sigma = 2
    ub = np.sqrt(np.diag(vcm))*trunc_sigma
    lb = -ub
    
    
    samplers = { 'not truncated' : lambda vcm,lb,ub,n: mvnd_chol(vcm,n),
                 'emcee' : lambda vcm,lb,ub,n: tmvnd_emcee(vcm,lb,ub,n),
                 'Botev 2016': lambda vcm,lb,ub,n: tmvnd_botev16(vcm,lb,ub,n),
                 'Li and Ghosh 2015': lambda vcm,lb,ub,n: tmvnd_Li_Ghosh_15(vcm,lb,ub,n)}
    
    samples = []
    for name, sampler in samplers.items():
        start = time.time()
        samples.append(sampler(vcm,lb,ub,nsamples))
        end = time.time()
        print("%s %f s" %(name, end - start))
acac
  • 41
  • 3
-3

A little late I guess but for the record, you could use Hamiltonian Monte Carlo. A module in Matlab exists named HMC exact. It shouldn't be too difficult to translate in Py.

adriki
  • 1