3

I have some ordinal data with missingness, which I am trying to model in Stan. Since Stan cannot handle discrete parameters directly, I am attempting to marginalize over the different possible values of the response variable for those cases which are missing.

Intuitively, I believe that I need the probability of the missing values being certain ordinal outcomes, multiplied by the probability of those ordinal outcomes, and then the probability of the missing values not being certain ordinal outcomes, multiplied by 1 - probability of those ordinal outcomes.

However, in practice I am uncertain if I am coding the model correctly in Stan itself or whether my exact mathematical intuition is correct for marginalising over the missing data. Although the Stan manual has examples of dealing with discrete parameters, I'm still a bit lost. I would also like to make predictions or attain an imputed vector of y values in Stan generated quantities block, but I could do with some pointers of getting started. The problem is not getting code running, more sanity checking that I have done it correctly.

Example

As an example, consider the wine data from the R package ordinal. The R code below takes the rating variable, transforms it to a 3 category scale for ease of explication, induces 15% missingness in the variable new_rating_w_miss, and then creates a vector missing that takes a 1 if the value is missing and a 0 otherwise. Since Stan won't take NA values, I have given the missing values a value of -1.

library(ordinal)
library(rstan)
library(runjags)
library(rjags)
data(wine)

wine$new_rating <- ifelse( wine$rating > 3, 3, wine$rating)
wine$new_rating_w_miss <- ifelse( wine$rating > 3, 3, wine$rating)
wine$temp_num <- ifelse(wine$temp == "cold", 1, 0)

set.seed(2017)
n_missing <- sample(0:nrow(wine), nrow(wine)*0.15, replace=FALSE)

wine$new_rating_w_miss[ n_missing ] <- -1
wine$is_missing <- ifelse(wine$new_rating_w_miss == -1, 1, 0 )

Stan model

My Stan model code is as follows:

data {
  int K;
  int<lower=0> N;
  int<lower=1> D;
  int y[N];
  int<lower=0> missing[N];
  int<lower=0> N_miss;
  row_vector[D] x[N];
}

parameters {
  vector[D] beta;
  real<lower=0,upper=1> p;
  ordered[K-1] c;
}

model {
  vector[K] theta;
  vector[N_miss] lp;
  beta ~ normal(0, 5);
  p ~ beta(2,2);
  for (n in 1:N) {
    real eta;
    eta = x[n] * beta;
    theta[1] = Phi( (c[1] - eta) );
    for (k in 2:(K-1))
      theta[k] = Phi( (c[k] - eta)) - Phi( (c[k-1] - eta));
    theta[K] = 1 - Phi( (c[K-1] - eta) );

    if(missing[n] == 0){
      y[n] ~ categorical(theta);
      missing[n] ~ bernoulli(p);
  }

if(missing[n] == 1){
target += log_mix( theta[1] ,
                    categorical_lpmf( 1 | theta ),
                    categorical_lpmf( (2 || 3) | theta )
                  )

            +

            log_mix( theta[2] ,
                                categorical_lpmf( 2 | theta ),
                                categorical_lpmf( (1 || 3) | theta )
                              )

            +

            log_mix( theta[3] ,
                                categorical_lpmf( 3 | theta ),
                                categorical_lpmf( (1 || 2) | theta )
                              );
  missing[n] ~ bernoulli(p);
    }
 }
}

As can be seen, I have attempted to marginalise over the discrete unknown values by using the log_mix() function in Stan. Specifically, adding up the (log) probabilities of the missing values being either 1, 2, or 3, given the (log) probability of those values estimated from the observed data, and then the counter-(log) probability.

One question is: should I be estimating separate theta vectors of probabilities from the observed and unobserved data?

The results:

Inference for Stan model: Stan_missing_model.
4 chains, each with iter=500; warmup=250; thin=1; 
post-warmup draws per chain=250, total post-warmup draws=1000.

           mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
beta[1]   -1.28    0.01 0.30   -1.83   -1.48   -1.28   -1.09   -0.67   808 1.00
p          0.16    0.00 0.04    0.09    0.13    0.15    0.18    0.24   812 1.00
c[1]      -1.26    0.01 0.23   -1.70   -1.43   -1.26   -1.11   -0.81   699 1.00
c[2]      -0.72    0.01 0.21   -1.15   -0.86   -0.70   -0.57   -0.27   746 1.00
lp__    -119.80    0.08 1.47 -123.45 -120.57 -119.34 -118.71 -118.02   378 1.02

Running the model in R

Running the model in R with the wine data above:

stan_data <- list( K = 3, N = nrow(wine), D = 1, 
                   y = wine$new_rating_w_miss, 
                   missing = wine$is_missing, 
                   N_miss = sum(wine$is_missing),
                   x = as.matrix(wine$temp_num) 
                  )

fit_stan <- stan(file = "Stan_missing_model.stan", data = stan_data, 
                  cores = 4, chains = 4, warmup = 250, iter = 500)
print(fit_stan)

Estimating a similar model in JAGS

A similar model in JAGS, which automatically imputes NA values, seem to arrive at not-too-dissimilar parameter estimates:

model{
   for( i in 1:N ){
       y[i] ~ dcat( theta[i, 1:3] )
       theta[i,1] <- phi( c[1] - mu[i] )
       theta[i,2] <- phi( c[2] - mu[i] ) - phi( c[1] - mu[i] )
       theta[i,3] <- 1 - phi( c[2] - mu[i] )

       mu[i] <- beta * temp[i]
       miss[i] ~ dbern( p )
    }

  beta ~ dnorm(0 , 5)
  p ~ dbeta( 2, 2)
  for(i in 1:2){
    c_raw[i] ~ dnorm(0,1.0E-3)
   }
  c <- sort(c_raw)

}

jags_data <- list( N = nrow(wine), 
                   y = ifelse(wine$new_rating_w_miss == -1, 
                              NA, wine$new_rating_w_miss), 
                   temp = wine$temp_num, 
                   miss = wine$is_missing
                  )

jags_monitor <- c("beta","p","c")

fit_jags <- run.jags(model = "JAGS_so_model.txt", 
                    data = jags_data, monitor = jags_monitor,
                    inits = list("c_raw" = c(-0.5,0.5)),
                    n.chains = 4, burnin = 500, sample = 500)
summary(fit_jags)

         Lower95     Median    Upper95       Mean        SD Mode       MCerr MC%ofSD SSeff       AC.10      psrf
beta -1.40131934 -0.8531391 -0.3148070 -0.8516923 0.2783449   NA 0.022165423     8.0   158  0.05891468 1.0284938
p     0.07983462  0.1547419  0.2356164  0.1576307 0.0410555   NA 0.001264152     3.1  1055 -0.00592979 0.9989195
c[1] -2.64389817 -2.0139135 -1.4006353 -2.0246384 0.3322312   NA 0.024410375     7.3   185  0.08432516 1.0213586
c[2] -1.35215148 -0.9021336 -0.4709347 -0.9031153 0.2319716   NA 0.019547231     8.4   141  0.07087327 1.0153249
user_15
  • 151
  • 9

1 Answers1

3

You need to code something equal to the log posterior plus a constant. That's usually the log joint for the model.

Missing continuous data is typically modeled using a parameter---see the manual chapter on missing data.

The || operation is logical or, so categorical_lpmf((2 || 3) | theta) is going to be equal to categorical_lpmf(1 | theta) because (2 || 3) evaluates to 1 (any non-zero value is treated as logically true and 1 encodes the true value).

If you have missing discrete data x that can take on values 1:K with log density/mass lp[k] when x == k, then the way to code it is as just target += log_sum_exp(lp);

You need to make sure to include the normalizing term in the lp[k].

You might want to work through everything and see if the missing data has any impact on the posterior---it often just factors out.

You don't observe different parameters for observed and unobserved data.

Bob Carpenter
  • 3,613
  • 1
  • 20
  • 13
  • Right, thanks Bob. I think I understand, but still shaky on putting `lp[k]` into action. So when the data is missing, I need `P(x==k | theta)` over all k? In my code, I have `if(missing[n] == 1){ for(k in 1:K){ lp[k] = log(theta[k]) + categorical_lpmf(k | theta) } }`, then `target += log_sum_exp(lp)`. This runs and is similar to the JAGS output. I am unsure how to include the normalising constant. Am I right that the normalising constant is `P(x) = P(x=1:K) * P(`x=1:K | theta)`? When I try to code `categorical_lpmf( 1 | theta)` in Stan I get `no matches for categorical_lpmf( int, vector)`. – user_15 Dec 12 '17 at 12:56
  • Sorry Bob, I don't get that error (I had misspelled `categorical_lpmf()`). In any case, I am still a bit lost in coding the normalising constant (my algebra is a bit rusty, hence my struggle wrapping my head around this). It's a great learning experience, however - much better than throwing it into JAGS and not thinking about it anymore. – user_15 Dec 12 '17 at 17:12
  • `categorical_lpmf(k | theta)` evaluates to `log(theta[k])`, so you shouldn't be doing that twice. What's the mixture model component? If it's a mixture of normals, that'd be `lp[k] = log(theta[k]) + normal(y[n] | mu[k], sigma[k]))`. And then the log-sum-exp as you have it. And you'd want to save `log(theta)` for efficiency. The no matches for vector is odd as you can definitely use `int` and `vector`; for example, `categorical_lpmf(1 | [0.2, 0.8]')` is fine (make sure you use the transpose operator). – Bob Carpenter Dec 13 '17 at 05:40
  • The error was due to a spelling mistake, so don't worry.The mixture component would be the vector `theta` I think because `theta[1:K]` holds the associated probabilities of each possible ordinal category. I did find a Stan example of the Bones BUGS model which had missing observations for an ordered response var. which didn't add any contribution of the missing observations to the log posterior - it just used `increment_log_prob` when the response was not missing. I think I'm having trouble because most examples in Stan marginalise out discrete _parameters_ but not missing discrete data. – user_15 Dec 13 '17 at 09:49
  • I'm also trying to generalise the approach discussed here https://groups.google.com/forum/#!topic/stan-users/9XEOnOiL99g to an ordinal response variable with 3 categories. – user_15 Dec 13 '17 at 10:43
  • The missing data are modeled with parameters Each missing data item gets a parameter. So there's no difference between marginalizing out parameters or missing data. See the missing data chapter of the Stan manual. Also, `increment_log_prob(u)` is now `target += u` – Bob Carpenter Dec 14 '17 at 22:38
  • Okay, thanks. I've been through the manual quite a bit. As far as I can tell, if the observed data `y` with discrete states `1:K` is modelled as `y ~ categorical(theta)`, with `theta` being a vector of length `K`, then the unobserved, missing data can be marginalised out by `for(k in 1:K) { lp[k] = log(theta[k]); } target += log_sum_exp(lp)`. – user_15 Dec 16 '17 at 11:55
  • That's right, but you'll see that it has no effect on the posterior as `log_sum_exp(lp)` will evaluate to 0, because `sum(theta)` evaluates to 1. That's what always happens when the variate is a leaf---you just generate it conditionally independently of everything else given `theta`. If you want `Pr[Y = k]` for some other inference, that's just `theta[k]`. – Bob Carpenter Dec 17 '17 at 15:28