0

I am trying to build a binomial regression model with rstan.

The aim is to get the effect size bt between two conditions X in a group t in total and the effect size btg for subgroups g.

library(rstan)

df <- data.frame(hits=c(36,1261,36,1261,49,1248,17,7670,25,759,29,755),trials=c(118,53850,184,53784,209,53759,118,53850,184,53784,209,53759)
                ,X=rep(c(1,0),6),g=rep(rep(1:3, each=2),2),t=rep(1:2,each=6),tg=rep(1:6,each=2) )

stanIn <- list(Nt=length(unique(df$t)), #number of groups t
            Nc=length(df$t), #number of rows
            Ng=length(unique(df$g)), #number of subgroups g
            Ntg=length(unique(df$tg)), #number of t and g combinations
            N=df$trials, 
            n=df$hits,
            X=df$X, #condition 1 or 0
            t=df$t, #index of groups
            g=df$g, #index of subgroups
            tg=df$tg) #index of combinations between t and g

model <- stan(data=stanIn, file="minimal.stan", chains = 4)

with minimal.stan as below.

data {
  int<lower=1> Nt; 
  int<lower=1> Nc; 
  int<lower=1> Ng; 
  int<lower=1> Ntg; 
  int<lower=1> N[Nc];
  int<lower=0> n[Nc]; 
  int<lower=0,upper=1> X[Nc]; 
  int<lower=1> t[Nc]; 
  int<lower=0> g[Nc]; 
  int<lower=1> tg[Nc]; 
}

parameters {
  real at[Nt]; // group intercepts 
  real bt[Nt]; // group slopes  
  real btg[Ntg]; // subgroup slopes 
  real atg[Ntg]; // subgroup intercept
}

transformed parameters {
  vector[Nc] theta; // binomial probabilities

  for (i in 1:Nc) { // linear model
    theta[i] = inv_logit( (atg[tg[i]]+at[t[i]] ) + (bt[t[i]]+btg[tg[i]]) * X[i]);
    //theta[i] = inv_logit(at[t[i]] + bt[t[i]] * X[i]); //group effect
    //theta[i] = inv_logit(atg[tg[i]] + btg[tg[i]] * X[i]); //subgroup effects
  }
}

model {
  at ~ normal(0.0, 20.0);
  bt ~ normal(0.0, 20.0);  
  atg ~ normal(0.0, 20.0);
  btg ~ normal(0.0, 20.0);
  n ~ binomial(N, theta);
}

I can model the overall group effect with the first commented line (in transformed parameters) and the subgroup effects with the second commented line. The idea was to combine both to get a group effect and the deviation from it for the individual group (first line).

However, this gives really weird results for bt and btg (A), and I was expecting something more like (B) (I can not recreate the behavior seen in A with a minimal example, this only occurs in the full dataset.)

enter image description here

If its not apparent from the type of question, I am completely new to statistic modeling and suspect that I have a conceptional error. So I would be grateful for any hint on this issue or a source to read up on those things (Feels like a common thing but I did not find anything).

kEks
  • 304
  • 1
  • 9
  • Welcome to Stackoverflow. It is not apparent how you want people to help you, start with your specific question. "this gives really weird results for bt and btg." is not a question. Along with your question, what is it that you think is "weird"? And it will be easier for people to help if you post your whole model so others can try running it and debugging it. – ssp3nc3r May 27 '20 at 16:13
  • thx, your right. I include a minimal example and why I think the results are "weird". (Although I can not create the exact behavior in a minimal example) – kEks May 28 '20 at 10:16
  • Also, include a question upfront. It seems your unstated question may be something like: "I'd like help combining group effects with subgroup effects in a binomial model. If I model group effects separately from subgroup effects, I get expected results." – ssp3nc3r May 28 '20 at 14:36
  • Also, I think your commented out code is incorrect: based on your parameter declarations, a and b should be at and bt. – ssp3nc3r May 28 '20 at 15:03

1 Answers1

0

I see several issues. Most directly, the model is not identified as the parameters atg and btg includes at and bt. I've changed those to ag and bg as they are your subgroup parameters, and indexed them as such, below:

library(rstan)

df <- data.frame(hits   = c(36, 1261, 36, 1261, 49, 1248, 
                            17, 7670, 25, 759, 29, 755),
                 trials = c(118, 53850, 184, 53784, 209, 53759, 
                            118, 53850, 184, 53784, 209, 53759),
                 X  = rep(c(1,0), 6),
                 g  = rep(rep(1:3, each=2), 2),
                 t  = rep(1:2, each=6),
                 tg = rep(1:6, each=2) )

stanIn <- list(Nt = length(unique(df$t)), #number of groups t
               Nc = length(df$t),         #number of rows
               Ng = length(unique(df$g)), #number of subgroups g
               N  = df$trials, 
               n  = df$hits,
               X  = df$X,                 #condition 1 or 0
               t  = df$t,                 #index of groups
               g  = df$g)                 #index of subgroups

model <- stan(data = stanIn, file = "minimal.stan", 
              cores = 4, chains = 4,
              control = list(max_treedepth = 14))

with this modified Stan model samples without issues:

data {
  int<lower=1> Nt; 
  int<lower=1> Nc; 
  int<lower=1> Ng; 
  int<lower=1> N[Nc];
  int<lower=0> n[Nc]; 
  vector<lower=0,upper=1>[Nc] X; 
  int<lower=1> t[Nc]; 
  int<lower=0> g[Nc]; 
}

parameters {
  vector<offset=0, multiplier=20>[Nt] at; // group intercepts 
  vector<offset=0, multiplier=20>[Nt] bt; // group slopes  
  vector<offset=0, multiplier=20>[Ng] ag; // subgroup intercepts
  vector<offset=0, multiplier=20>[Ng] bg; // subgroup slopes
}

model {
  at ~ normal(0.0, 20.0);
  bt ~ normal(0.0, 20.0);  
  ag ~ normal(0.0, 20.0);
  bg ~ normal(0.0, 20.0);
  n  ~ binomial_logit(N, ag[g] + at[t] + (bt[t] + bg[g]) .* X);
}

generated quantities {
  vector[Nc] theta = inv_logit(ag[g] + at[t] + (bt[t] + bg[g]) .* X);
}

Of note, I had to use a high max_treedepth to fit the model, but it's hard for me to comment on that without understanding the data. I've also moved theta to generated quantities in case you want those calculations, but the binomial_logit handles this directly.

I've also setup noncentered parameters using offset and multiplier so that the Stan sampler has a better parameter space to sample from. Finally, I've re-coded the loops as vectors.

ssp3nc3r
  • 3,662
  • 2
  • 13
  • 23
  • In your solution I get three *ag* and *bg* parameters, one for each subgroup. Which would correspond to the influence of a given subgroup over **all** groups. However, my aim was to get a *atg* and *btg* for **each** subgroup (making it 6 in this example). But if I understand you correctly this is non nonsensical and leads to an unidentified model, correct? – kEks May 29 '20 at 10:11
  • Both groups and subgroups are included. By having both the g and t parameters, the combination of parameters captures all 6 combinations. Feel free to modify as you like. – ssp3nc3r May 30 '20 at 16:05