Efficiently sample a collection of multi-normal variables with varying sigma (covariance) matrix

Question

I'm new to Stan, so hoping you can point me in the right direction. I'll build up to my situation to make sure we're on the same page...

If I had a collection of univariate normals, the docs tell me that:

y ~ normal(mu_vec, sigma);

provides the same model as the unvectorized version:

for (n in 1:N)
    y[n] ~ normal(mu_vec[n], sigma);

but that the vectorized version is (much?) faster. Ok, fine, makes good sense.

So the first question is: is it possible to take advantage of this vectorization speedup in the univariate normal case where both the mu and sigma of the samples vary by position in the vector. I.e. if both mu_vec and sigma_vec are vectors (in the previous case sigma was a scalar), then is this:

y ~ normal(mu_vec, sigma_vec);

equivalent to this:

for (n in 1:N)
    y[n] ~ normal(mu_vec[n], sigma_vec[n]);

and if so is there a comparable speedup?

Ok. That's the warmup. The real question is how to best approach the multi-variate equivalent of the above.

In my particular case, I have N of observations of bivariate data for some variable y, which I store in an N x 2 matrix. (For order of magnitude, N is about 1000 in my use case.)

My belief is that the mean of each component of each observation is 0 and that the stdev of each component is each observation is 1 (and I'm happy to hard code them as such). However, my belief is that the correlation (rho) varies from observation to observation as a (simple) function of another observed variable, x (stored in an N element vector). For example, we might say that rho[n] = 2*inverse_logit(beta * x[n]) - 1 for n in 1:N and our goal is to learn about beta from our data. I.e. the covariance matrix for the nth observation would be:

[1,      rho[n]]
[rho[n], 1     ]

My question is what's the best way to put this together in a STAN model so that it isn't slow as heck? Is there a vectorized version of the multi_normal distribution so that I could specify this as:

y ~ multi_normal(vector_of_mu_2_tuples, vector_of_sigma_matrices)

or perhaps as some other similar formulation? Or will I need to write:

for (n in 1:N)
    y[n] ~ multi_normal(vector_of_mu_2_tuples[n], vector_of_sigma_matrices[n])

after having set up vector_of_sigma_matrices and vector_of_mu_2_tuples in an earlier block?

Thanks in advance for any guidance!

Edit to add code

Using python, I can generate data in the spirit of my problem as follows:

import numpy as np
import pandas as pd
import pystan as pys
import scipy as sp
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
import seaborn as sns

def gen_normal_data(N, true_beta, true_mu, true_stdevs):
    N = N

    true_beta = true_beta
    true_mu = true_mu
    true_stdevs = true_stdevs

    drivers = np.random.randn(N)
    correls = 2.0 * sp.special.expit(drivers*true_beta)-1.0

    observations = []
    for i in range(N):
        covar = np.array([[true_stdevs[0]**2, true_stdevs[0] * true_stdevs[1] * correls[i]],
                          [true_stdevs[0] * true_stdevs[1] * correls[i], true_stdevs[1]**2]])
        observations.append(sp.stats.multivariate_normal.rvs(true_mu, covar, size=1).tolist())
    observations = np.array(observations)
    
    return {
        'N': N,
        'true_mu': true_mu,
        'true_stdev': true_stdevs,
        'y': observations,
        'd': drivers,
        'correls': correls
    }

and then actually generate the data using:

normal_data = gen_normal_data(100, 1.5, np.array([1., 5.]), np.array([2., 5.]))

Here's what the data set looks like (scatterplot of y colored by correls in the left pane and by drivers in the right pane...so the idea is that the higher the driver the closer to 1 the correl and the lower the driver, the closer to -1 the correl. So would expect red dots on the left pane to be "down-left to up-right" and blue dots to be "up-left to down-right", and indeed they are:

fig, axes = plt.subplots(1, 2, figsize=(15, 6))
x = normal_data['y'][:, 0]
y = normal_data['y'][:, 1]
correls = normal_data['correls']
drivers = normal_data['d']

for ax, colordata, cmap in zip(axes, [correls, drivers], ['coolwarm', 'viridis']):
    color_extreme = max(abs(colordata.max()), abs(colordata.min()))
    sc = ax.scatter(x, y, c=colordata, lw=0, cmap=cmap, vmin=-color_extreme, vmax=color_extreme)
    divider = make_axes_locatable(ax)
    cax = divider.append_axes('right', size='5%', pad=0.05)
    fig.colorbar(sc, cax=cax, orientation='vertical')
fig.tight_layout()

Using the brute force approach, I can set up a STAN model that looks like this:

model_naked = pys.StanModel(
    model_name='naked',
    model_code="""
data {
    int<lower=0> N;
    vector[2] true_mu;
    vector[2] true_stdev;
    real d[N];
    vector[2] y[N];
}

parameters {
    real beta;
}

transformed parameters {
}

model {
    real rho[N];
    matrix[2, 2] cov[N];
    
    for (n in 1:N) {
        rho[n] = 2.0*inv_logit(beta * d[n]) - 1.0;
 
        cov[n, 1, 1] = true_stdev[1]^2;
        cov[n, 1, 2] = true_stdev[1] * true_stdev[2] * rho[n];
        cov[n, 2, 1] = true_stdev[1] * true_stdev[2] * rho[n];
        cov[n, 2, 2] = true_stdev[2]^2;
    }
    
    beta ~ normal(0, 10000);
    for (n in 1:N) {
        y[n] ~ multi_normal(true_mu, cov[n]);
    }
}
"""
)

This fits nicely:

fit_naked = model_naked.sampling(data=normal_data, iter=1000, chains=2)f = fit_naked.plot();
f.tight_layout()

But I'm hoping someone can point me in the right direction for the "marginalized" approach where we break down our bivariate normal into a pair of independent normals that can be blended using the correlation. The reason I need this is that in my actual use case, both dimensions of are fat-tailed. I am happy to model this as a student-t distribution, but the issue is that STAN only allows a single nu to be specified (not one for each dimension), so I think I'll need to find a way to decompose a multi_student_t into a pair of independent student_t's so that I can set the degrees of freedom separately for each dimension.

score 2 · Accepted Answer · answered May 04 '18 at 13:19

2

The univariate normal distribution does accept vectors for any or all of its arguments and it will be faster than looping over the N observations to call it N times with scalar arguments.

However, the speedup is only going to be linear because the calculations are all the same, but it only has to allocate memory once if you only call it once. The overall wall time is more affected by the number of function evaluations you have to do, which is up to 2^10 - 1 per MCMC iteration (by default), but whether you hit the maximum treedepth depends on the geometry of the posterior distribution you are trying to sample from, which, in turn, depends on everything including the data you condition on.

The bivariate normal distribution can be written as a product of a marginal univariate normal distribution for the first variable and a conditional univariate normal distribution for the second variable given the first variable. In Stan code, we can utilize element-wise multiplication and division to write its log-density like

target += normal_lpdf(first_variable | first_means, first_sigmas);
target += normal_lpdf(second_variable | second_means + 
  rhos .* first_sigmas ./ second_sigmas .* (first_variable - first_means),
  second_sigmas .* sqrt(1 - square(rhos)));

Unfortunately, the more general multivariate normal distribution in Stan does not have an implementation that inputs arrays of covariance matrices.

answered May 04 '18 at 13:19

Ben Goodrich

4,870
1
19
18

1

Thank you for your quick response! First off, I want to be clear that I understand your final point: "No, there isn't a nice tidy vectorized version of `multi_normal` as there is for `normal`." Trying to run with the rest of your answer, and with apologies for not being more of a pro at STAN here...could you give a bit more context for how I could use your code snippet in my context to get my example up and running? Or, alternatlively, with 1000 observations of my bivariate data, how catastrophic do you think it would be if I tried to fit my `beta` using the loop over multi-normals? – 8one6 May 04 '18 at 13:26
1

Also, can you confirm: It looks like the `multi_normal` implementation in STAN *does* accept an array of vectors for `mu` so that you *could* have a sample-specific mean. (But does not currently support the same for the covariance `sigma` matrix). If that's right, do you think this would be a reasonable feature request for STAN? (I.e. allowing arrays of both `sigma` and `mu` instead of just for `mu`...) – 8one6 May 04 '18 at 17:11
1

I would need to see your whole code to get all the variable names right. It would be a reasonable feature request to allow `multi_normal_lpdf` and friends to input an an array of covariance matrices, but even if that existed, you still would not want to utilize it in the bivariate normal case. – Ben Goodrich May 04 '18 at 23:27
Ben, please see my edits above. This is (should be?) fully executable python/pySTAN code, so I'm hoping it's easy for you to replicate/see what I'm getting at. I have included full code to generate toy data, as well as the STAN model that recovers my parameters correctly (great!). Hoping you could show me how to code up this example using the approach you mentioned since I'd like to go from normal to student-t distributions but such a transition would be much easier with your approach (I hope!) than mine. (I'd be happy to put a bounty on this if that would help!) – 8one6 May 06 '18 at 12:42

score 2 · Answer 2 · answered May 06 '18 at 18:07

This isn't quite answering your question, but you can make your program more efficient by removing a bunch of redundant calculations and converting scale a bit to use tanh rather than scaled inverse logit. I'd get rid of the scaling and just use smaller betas, but I left it so that it should get the same results.

data {
  int<lower=0> N;
  vector[2] mu;
  vector[2] sigma;
  vector[N] d;
  vector[2] y[N];
}
parameters {
  real beta;
}
transformed data {
  real var1 = square(sigma[1]);
  real var2 = square(sigma[2]);
  real covar12 = sigma[1] * sigma[2];
  vector[N] d_div_2 = d * 0.5;
}
model {
  // note: tanh(u) = 2 * inv_logit(u / 2) - 1
  vector[N] rho = tanh(beta * d_div_2);
  matrix[2, 2] Sigma;
  Sigma[1, 1] = var1;
  Sigma[2, 2] = var2;
  // only reassign what's necessary with minimal recomputation
  for (n in 1:N) {
    Sigma[1, 2] = rho[n] * covar12;
    Sigma[2, 1] = Sigma[1, 2];
    y[n] ~ multi_normal(true_mu, Sigma);
  }
  // weakly informative priors fit more easily
  beta ~ normal(0, 8);
}

You could also factor by figuring out Cholesky factorization as function of rho and other fixed values and use that---it saves a solver step in the multivariate normal.

The other option you have is to write out the multi-student-t directly rather than using our built-in implementation. The built-in probably won't be a whole lot faster as the whole operation's pretty heavily dominated by the matrix solve.

Thank you so much for your helpful comments! Your thoughts about reducing redundancy are very much appreciated and quite clear. I was hoping you might be able to go a bit further and explain the comments about switching to a Cholesky factorization. Are you suggesting a) that we use a STAN function to do the decomposition after the `Sigma` matrix is defined (and then feed that into `multi_normal_cholesky` or are you suggesting b) that we do the Cholesky factorization by hand (not super hard analytically for a 2-d case) and then feed that in? Would you be able to provide an example? — 8one6, May 07 '18 at 00:44
Also, and I know this is asking a lot, it would be incredibly helpful if you were able to give a working example of the approach which decomposes the bivariate normal into a combination of independent normals weighted appropriately to get the correlation right. I believe my efforts have come up short because I'm not manually adjusting for a non-unit Jacobian properly. This is the whole "if `A` and `B` are standard bivariate normal with correlation `rho`, then you can write `B = rho * A + sqrt(1-rho^2) * Z` where `Z` is another standard normal independent of `A`. — 8one6, May 07 '18 at 01:06
No, I'm suggesting that since it's only 2 x 2, you could work out the Cholesky decomposition analytically as a function of rho and then use `multi_normal_cholesky`. You can work through the multivariate case and break out the 2D version. What you have above implies that `B ~ normal(rho * A, sqrt(1 - rho^2))`. So presumably you could couple that with `A ~ normal(muA, sigmaA)` where `muA` and `sigmaA` are the marginal mean and standard deviation of `A`. — Bob Carpenter, May 08 '18 at 03:20

Efficiently sample a collection of multi-normal variables with varying sigma (covariance) matrix

2 Answers2