I have some ordinal data with missingness, which I am trying to model in Stan. Since Stan cannot handle discrete parameters directly, I am attempting to marginalize over the different possible values of the response variable for those cases which are missing.
Intuitively, I believe that I need the probability of the missing values being certain ordinal outcomes, multiplied by the probability of those ordinal outcomes, and then the probability of the missing values not being certain ordinal outcomes, multiplied by 1 - probability of those ordinal outcomes.
However, in practice I am uncertain if I am coding the model correctly in Stan itself or whether my exact mathematical intuition is correct for marginalising over the missing data. Although the Stan manual has examples of dealing with discrete parameters, I'm still a bit lost. I would also like to make predictions or attain an imputed vector of y
values in Stan generated quantities
block, but I could do with some pointers of getting started. The problem is not getting code running, more sanity checking that I have done it correctly.
Example
As an example, consider the wine
data from the R package ordinal
. The R code below takes the rating
variable, transforms it to a 3 category scale for ease of explication, induces 15% missingness in the variable new_rating_w_miss
, and then creates a vector missing
that takes a 1 if the value is missing and a 0 otherwise. Since Stan won't take NA
values, I have given the missing values a value of -1
.
library(ordinal)
library(rstan)
library(runjags)
library(rjags)
data(wine)
wine$new_rating <- ifelse( wine$rating > 3, 3, wine$rating)
wine$new_rating_w_miss <- ifelse( wine$rating > 3, 3, wine$rating)
wine$temp_num <- ifelse(wine$temp == "cold", 1, 0)
set.seed(2017)
n_missing <- sample(0:nrow(wine), nrow(wine)*0.15, replace=FALSE)
wine$new_rating_w_miss[ n_missing ] <- -1
wine$is_missing <- ifelse(wine$new_rating_w_miss == -1, 1, 0 )
Stan model
My Stan model code is as follows:
data {
int K;
int<lower=0> N;
int<lower=1> D;
int y[N];
int<lower=0> missing[N];
int<lower=0> N_miss;
row_vector[D] x[N];
}
parameters {
vector[D] beta;
real<lower=0,upper=1> p;
ordered[K-1] c;
}
model {
vector[K] theta;
vector[N_miss] lp;
beta ~ normal(0, 5);
p ~ beta(2,2);
for (n in 1:N) {
real eta;
eta = x[n] * beta;
theta[1] = Phi( (c[1] - eta) );
for (k in 2:(K-1))
theta[k] = Phi( (c[k] - eta)) - Phi( (c[k-1] - eta));
theta[K] = 1 - Phi( (c[K-1] - eta) );
if(missing[n] == 0){
y[n] ~ categorical(theta);
missing[n] ~ bernoulli(p);
}
if(missing[n] == 1){
target += log_mix( theta[1] ,
categorical_lpmf( 1 | theta ),
categorical_lpmf( (2 || 3) | theta )
)
+
log_mix( theta[2] ,
categorical_lpmf( 2 | theta ),
categorical_lpmf( (1 || 3) | theta )
)
+
log_mix( theta[3] ,
categorical_lpmf( 3 | theta ),
categorical_lpmf( (1 || 2) | theta )
);
missing[n] ~ bernoulli(p);
}
}
}
As can be seen, I have attempted to marginalise over the discrete unknown values by using the log_mix()
function in Stan. Specifically, adding up the (log) probabilities of the missing values being either 1
, 2
, or 3
, given the (log) probability of those values estimated from the observed data, and then the counter-(log) probability.
One question is: should I be estimating separate theta
vectors of probabilities from the observed and unobserved data?
The results:
Inference for Stan model: Stan_missing_model.
4 chains, each with iter=500; warmup=250; thin=1;
post-warmup draws per chain=250, total post-warmup draws=1000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[1] -1.28 0.01 0.30 -1.83 -1.48 -1.28 -1.09 -0.67 808 1.00
p 0.16 0.00 0.04 0.09 0.13 0.15 0.18 0.24 812 1.00
c[1] -1.26 0.01 0.23 -1.70 -1.43 -1.26 -1.11 -0.81 699 1.00
c[2] -0.72 0.01 0.21 -1.15 -0.86 -0.70 -0.57 -0.27 746 1.00
lp__ -119.80 0.08 1.47 -123.45 -120.57 -119.34 -118.71 -118.02 378 1.02
Running the model in R
Running the model in R
with the wine
data above:
stan_data <- list( K = 3, N = nrow(wine), D = 1,
y = wine$new_rating_w_miss,
missing = wine$is_missing,
N_miss = sum(wine$is_missing),
x = as.matrix(wine$temp_num)
)
fit_stan <- stan(file = "Stan_missing_model.stan", data = stan_data,
cores = 4, chains = 4, warmup = 250, iter = 500)
print(fit_stan)
Estimating a similar model in JAGS
A similar model in JAGS, which automatically imputes NA values, seem to arrive at not-too-dissimilar parameter estimates:
model{
for( i in 1:N ){
y[i] ~ dcat( theta[i, 1:3] )
theta[i,1] <- phi( c[1] - mu[i] )
theta[i,2] <- phi( c[2] - mu[i] ) - phi( c[1] - mu[i] )
theta[i,3] <- 1 - phi( c[2] - mu[i] )
mu[i] <- beta * temp[i]
miss[i] ~ dbern( p )
}
beta ~ dnorm(0 , 5)
p ~ dbeta( 2, 2)
for(i in 1:2){
c_raw[i] ~ dnorm(0,1.0E-3)
}
c <- sort(c_raw)
}
jags_data <- list( N = nrow(wine),
y = ifelse(wine$new_rating_w_miss == -1,
NA, wine$new_rating_w_miss),
temp = wine$temp_num,
miss = wine$is_missing
)
jags_monitor <- c("beta","p","c")
fit_jags <- run.jags(model = "JAGS_so_model.txt",
data = jags_data, monitor = jags_monitor,
inits = list("c_raw" = c(-0.5,0.5)),
n.chains = 4, burnin = 500, sample = 500)
summary(fit_jags)
Lower95 Median Upper95 Mean SD Mode MCerr MC%ofSD SSeff AC.10 psrf
beta -1.40131934 -0.8531391 -0.3148070 -0.8516923 0.2783449 NA 0.022165423 8.0 158 0.05891468 1.0284938
p 0.07983462 0.1547419 0.2356164 0.1576307 0.0410555 NA 0.001264152 3.1 1055 -0.00592979 0.9989195
c[1] -2.64389817 -2.0139135 -1.4006353 -2.0246384 0.3322312 NA 0.024410375 7.3 185 0.08432516 1.0213586
c[2] -1.35215148 -0.9021336 -0.4709347 -0.9031153 0.2319716 NA 0.019547231 8.4 141 0.07087327 1.0153249