I am making a multilevel logistic regression model in stan (using rstan) to fit to a large dataset. My dataset is a 2000 by 100100 matrix, containing simulated sequencing data of 100100 positions in the genome (SNP) of 2000 individuals. The goal of the model is to estimate the effect of each SNP on an outcome variable which is 0 or 1 (healthy or sick).
I want to estimate 100100 slopes, one for each SNP. I have written the following model in STAN, which I was able to fit to a cropped dataset of 2000 by 101, with Rhat = 1 and very high n_eff values.
data{
int<lower=1> N; // Rows
int<lower=1> M; // Columns
int<lower=0, upper=1> y[N]; // Outcome variable
matrix[N, M] x; // Predictor variable
}
parameters{
// Hyper priors
real mu;
real<lower=0> sigma;
// Priors
real a;
vector[M] b;
}
model{
// Hyper-priors
mu ~ normal(0, 10);
sigma ~ cauchy(0, 10);
// Priors
a ~ normal( 0 , 10 );
b ~ normal(mu, sigma);
// Likelihood
y ~ bernoulli_logit(a + x * b);
}
When I try to fit the model to the large dataset using the rstan function sampling(), there is no response from R. Compilation of the model works fine. Do you know what I am doing wrong?
The R-code looks like this:
model <- stan_model("mult_predictor.stan")
fit <- sampling(model, list(N=2000,
M=100100,
y=y,
x=x),
iter=2000, chains=4)