1

I am making a multilevel logistic regression model in stan (using rstan) to fit to a large dataset. My dataset is a 2000 by 100100 matrix, containing simulated sequencing data of 100100 positions in the genome (SNP) of 2000 individuals. The goal of the model is to estimate the effect of each SNP on an outcome variable which is 0 or 1 (healthy or sick).

I want to estimate 100100 slopes, one for each SNP. I have written the following model in STAN, which I was able to fit to a cropped dataset of 2000 by 101, with Rhat = 1 and very high n_eff values.

data{
  int<lower=1> N; // Rows
  int<lower=1> M; // Columns

  int<lower=0, upper=1> y[N]; // Outcome variable
  matrix[N, M] x; // Predictor variable
}

parameters{
  // Hyper priors
  real mu;
  real<lower=0> sigma;

  // Priors
  real a;
  vector[M] b;
}

model{
  // Hyper-priors
  mu ~ normal(0, 10);
  sigma ~ cauchy(0, 10);

  // Priors
  a ~ normal( 0 , 10 );
  b ~ normal(mu, sigma);

  // Likelihood
  y ~ bernoulli_logit(a + x * b);
}

When I try to fit the model to the large dataset using the rstan function sampling(), there is no response from R. Compilation of the model works fine. Do you know what I am doing wrong?

The R-code looks like this:

model <- stan_model("mult_predictor.stan")
fit <- sampling(model, list(N=2000, 
                            M=100100, 
                            y=y, 
                            x=x),
                iter=2000, chains=4)

  • 1
    It is possible that your code is working just fine and that the dataset is too big. Try running the code on increasing data sizes and investigate runtime and memory consumption. By extrapolation you could then estimate what would happen on the full dataset. – BigFinger Jan 02 '20 at 17:33
  • It takes an extremely long time to fit ca. 150 parameters or above. From the litterature I can understand that multilevel models with 10.000 parameters are not uncommon, so it should be possible to fit something in the ball-park of 10.000 parameters (maybe more). In my model there will be a parameter for each column - 100100 parameters if the full dataset is used. – Kasper Fischer-Rasmussen Jan 03 '20 at 11:08
  • 2
    Yes, it's possible to fit 100K parameters with Stan, but the model has to make sense for the data or things get very slow. In your case, with way more parameters than data points, you probably need strong priors. And you probably want a non-centered parameterization of `b` (see the user's guide for details). I woluld second BigFinger's advice to simulate data of increasing size to see where things break. Output size should 100K * 4K, or 400M numbers at 8 bytes each, or abot 3.2GB of memory. R starts by preallocating that. – Bob Carpenter Jan 03 '20 at 16:01
  • 1
    Just another note to add about the priors: as a base model, consider starting with a Bayesian ridge regression, i.e., zero-centered Laplace prior on the coefficients: `b ~ double_exponential(0, sigma)`. Even if you ultimately don't use it, there is [precedent for its use in GWAS](https://www.sciencedirect.com/science/article/pii/S0002929708000918) , so it could be a good starting place. – merv Jan 11 '20 at 21:38
  • What is the utility of ridge-regression as compared to a multilevel model? From what I understand about ridge-regression is, it's basically just using a very narrow prior (near 0). Why is this different from the multilevel model above, where the prior is learned from the data, including its width and thereby degree of constraining? In the case of ridge-regression sigma is still learned from the data, and can potentially be large allowing many values around zero. – Kasper Fischer-Rasmussen Jan 14 '20 at 17:05

0 Answers0