linear model diagnostics for Bayesian models using rstan

Question

I'm looking for an efficient method to identify data points that have an outsize effect on the parameters of a linear model. This is straight-forward with ordinary linear models, but I'm not sure how to do it with Bayesian linear models.

Here's one way with ordinary linear models, we can compute the Cook's distance for each data point, and plot diagnostic plots that include Cook's distances:

# ordinary linear model diagnostics, similar to my use-case
library(dplyr)
library(purrr)
library(tidyr)
library(broom)
# make linear models for each type of species
xy <- 
  iris %>% 
  nest(-Species) %>% 
  mutate(model = map(data, 
                     ~lm(Sepal.Length ~ Petal.Width, 
                         data = .)),
         fit = map(model, augment))

Here we have a data frame with nested lists, the model column contains the linear model for each species:

> xy
# A tibble: 3 × 4
     Species              data    model                   fit
      <fctr>            <list>   <list>                <list>
1     setosa <tibble [50 × 4]> <S3: lm> <data.frame [50 × 9]>
2 versicolor <tibble [50 × 4]> <S3: lm> <data.frame [50 × 9]>
3  virginica <tibble [50 × 4]> <S3: lm> <data.frame [50 × 9]>

The broom::augment function allowed us to add the Cook's distance values for each data point into this data frame, and we can inspect them like this:

# inspect Cook's distance values
xy %>% 
 unnest(fit) %>% 
 arrange(desc(.cooksd))

  # A tibble: 150 × 10
      Species Sepal.Length Petal.Width  .fitted    .se.fit     .resid       .hat    .sigma    .cooksd
       <fctr>        <dbl>       <dbl>    <dbl>      <dbl>      <dbl>      <dbl>     <dbl>      <dbl>
1  versicolor          5.9         1.8 6.612097 0.16181001 -0.7120969 0.13725081 0.4269862 0.24507448
2      setosa          5.0         0.6 5.335281 0.17114108 -0.3352811 0.25027563 0.3410686 0.21385214
3   virginica          4.9         1.7 6.375829 0.13613717 -1.4758292 0.04875277 0.5826838 0.15434787
4      setosa          5.7         0.4 5.149247 0.08625887  0.5507534 0.06357957 0.3355980 0.09396588
5      setosa          4.3         0.1 4.870195 0.08321347 -0.5701948 0.05916942 0.3349111 0.09285408
6   virginica          5.8         2.4 6.831411 0.14828703 -1.0314106 0.05784319 0.6035012 0.09117693
7   virginica          7.2         1.6 6.310746 0.16207266  0.8892538 0.06909799 0.6084108 0.08293253
8  versicolor          4.9         1.0 5.471005 0.11998077 -0.5710051 0.07546185 0.4328038 0.07544526
9      setosa          5.8         0.2 4.963212 0.05287342  0.8367879 0.02388828 0.3228858 0.07500610
10 versicolor          6.0         1.0 5.471005 0.11998077  0.5289949 0.07546185 0.4340307 0.06475225
# ... with 140 more rows, and 1 more variables: .std.resid <dbl>

And with the autoplot method, we can make informative diagnostic plots that show Cook's distance values, and help us to quickly identify data points with an outsize effect on the models' paramters:

# plot model diagnostics
library(ggplot2)
library(ggfortify)
diagnostic_plots_df <- 
  xy %>%  
  mutate(diagnostic_plots = map(model, 
                                ~autoplot(., 
                                          which = 1:6, 
                                          ncol = 3, 
                                          label.size = 3)))

Here's just one of the plots produced:

> diagnostic_plots_df[[1]]

Now, with the Bayesian linear model we can similarly compute linear models for each group in the data frame:

# Bayesian linear model diagnostics
library(rstanarm)
bayes_xy <- 
  iris %>% 
  nest(-Species) %>% 
  mutate(model = map(data, 
                     ~stan_lm(Sepal.Length ~ Petal.Width, 
                         data = .,
                         prior = NULL, 
                         chains = 1, 
                         cores = 2, 
                         seed = 1)),
         fit =  map(model, augment))

> bayes_xy
# A tibble: 3 × 4
     Species              data         model                   fit
      <fctr>            <list>        <list>                <list>
1     setosa <tibble [50 × 4]> <S3: stanreg> <data.frame [50 × 5]>
2 versicolor <tibble [50 × 4]> <S3: stanreg> <data.frame [50 × 5]>
3  virginica <tibble [50 × 4]> <S3: stanreg> <data.frame [50 × 5]>

But the broom::augment method doesn't have anything like a Cook's distance value:

# inspect fit diagnostics
bayes_xy %>% unnest(fit)

# A tibble: 150 × 6
   Species Sepal.Length Petal.Width  .fitted    .se.fit      .resid
    <fctr>        <dbl>       <dbl>    <dbl>      <dbl>       <dbl>
1   setosa          5.1         0.2 4.963968 0.06020298  0.13482025
2   setosa          4.9         0.2 4.963968 0.06020298 -0.06517975
3   setosa          4.7         0.2 4.963968 0.06020298 -0.26517975
4   setosa          4.6         0.2 4.963968 0.06020298 -0.36517975
5   setosa          5.0         0.2 4.963968 0.06020298  0.03482025
6   setosa          5.4         0.4 5.151501 0.11299956  0.21818386
7   setosa          4.6         0.3 5.057734 0.05951488 -0.47349794
8   setosa          5.0         0.2 4.963968 0.06020298  0.03482025
9   setosa          4.4         0.2 4.963968 0.06020298 -0.56517975
10  setosa          4.9         0.1 4.870201 0.11408783  0.04313845
# ... with 140 more rows

And no autoplot method:

# plot model diagnostics
bayes_diagnostic_plots_df <- 
  bayes_xy %>%  
  mutate(diagnostic_plots = map(model, 
                                ~autoplot(., 
                                          which = 1:6, 
                                          ncol = 3, 
                                          label.size = 3)))

# Error, there doesn't seem to be an autoplot method for stan_lm output

shinystan::launch_shinystan(bayes_xy$model[[1]])

# This is quite interesting, but nothing at the level of specific data points

Some of the scholarly literature talks about methods such as model perturbations, phi-divergence, Cook's posterior mode distance and Cook's posterior mean distance, Kullback-Leibler divergence, etc., etc.. But I can't see anywhere this has been explored with R code, and I'm stuck.

There is an unanswered question on this topic at Cross-validated. I'm posting here because I'm looking for ideas about writing code to compute influence statistics (rather than advice about the statistical theory and approach, etc., which should go at that other question)

How can I get something like a Cook's Distance measure from the output of rstanarm::stan_lm?

score 4 · Answer 1 · answered Sep 20 '16 at 13:26

This post said by Aki Vehtari said it best:

The difference between lppd_i and loo_i has been used as a sensitivity measure (see, e.g., Gelfand et al 1992). Pareto shape parameter estimate k is likely to be large if the difference between lppd_i and loo_i is large. It's not yet clear to me whether the Pareto shape parameter estimate k would be better than lppd_i-loo_i, but at least we know that estimate for lppd_i-loo_i is too small if k is close to 1 or larger, so it might be better to look at k. In stack example with normal model, k for one observation is large, but with student-t model k is smaller. Normal model is same as student-t model, but with very strong prior on degrees of freedom. So it's not just about having strong prior or more shrinkage, but having a model which can describe the observations well. With increased shrinkage and non-robust observation model, the one observation could still be surprising. Naturally it's not always the best solution to change to a more robust observation model allowing "outliers". Instead it might be better to make the regression function more nonlinear (that is having a less strong prior), or transform covariates, or add more covariates. So I do recommend looking at Pareto shape parameter values, but I don't recommend to increase shrinkage if the values are large.

You can obtain the Pareto shape parameter estimate k from the $pareto_k element of the list produced by the loo function in the loo package, which is reexported by the rstanarm package. If this value is higher than 0.7 (by default), the loo function will recommend that you re-fit the model leaving out this observation because the posterior distribution is likely to be too sensitive to this observation to satisfy the assumption of the LOOIC that each observation has a negligible impact on the posterior distribution.

In the OP's case, only the seventh observation has a Pareto shape parameter estimate that is slightly greater than 0.5, so that observation probably does not have an extreme effect on the posterior. But you definitely want to investigate observations that have a value greater than 1.0

You can also call the plot method for a loo object, especially with the non-default option label_points = TRUE to visualize the Pareto shape parameter estimates.

score 1 · Answer 2 · answered Sep 19 '16 at 20:27

Looking into some of the discussion on the stan-users email list, I saw that output from the loo package, contains 'pointwise contributions for each observation'. So here's an attempt at working with those:

# Bayesian linear model diagnostics
library(rstanarm)
library(loo)
bayes_xy <- 
  iris %>% 
  nest(-Species) %>% 
  mutate(model = map(data, 
                     ~stan_lm(Sepal.Length ~ Petal.Width, 
                         data = .,
                         prior = NULL, 
                         chains = 1, 
                         cores = 2, 
                         seed = 1)))


bayes_xy_loo <- 
bayes_xy %>% 
  mutate(loo_out = map(model, ~loo(.)))

library(ggplot2)
library(ggrepel)
n <-  5 # how many points to label


my_plots <- 
bayes_xy_loo %>% 
  select(loo_out) %>% 
  mutate(loo_pointwise = map(.$loo_out, ~data.frame(.$pointwise))) %>% 
  mutate(plots = map(.$loo_pointwise, 
      ~ggplot(., 
       aes(elpd_loo,
           looic)) +
       geom_point(aes(size = p_loo)) +
       geom_text_repel(data = .[tail(order(.$looic), n),] ,
                  aes(label = row.names(.[tail(order(.$looic), n),])),
                  nudge_x = -0.1,
                  nudge_y = -0.3) +
        geom_label_repel(data = .[tail(order(.$elpd_loo), n),] ,
                        aes(label = row.names(.[tail(order(.$elpd_loo), n),])),
                        nudge_x = 0.1,
                        nudge_y = 0.1) +
       xlab("Expected log pointwise \npredictive density") +
       ylab("LOO information \ncriterion") +
       scale_size_area(name = "Estimated \neffective\nnumber \nof parameters") +
       theme_minimal(base_size = 10)))

do.call(gridExtra::grid.arrange, my_plots$plots)

However, the points suggested to be influential are not a good match. For example, in the question we have obs. 7, 15, and 30 with high Cook's Distance values. In the loo output, it seems that only obs. 15 is identified as usual. So perhaps this isn't the right way to do it.

The thread on stan-users refers to the estimated shape parameter of the generalized Pareto distribution as being a measure of how much the posterior distribution would change if a single observation were omitted. This can be plotted with `par(mfrow = c(3,1), mar = c(5,4,1,1) + .1); sapply(bayes_xy_loo$loo_out, FUN = plot, label_points = TRUE)` Observation 7 in the third species is singled out as a little high, although 15 and 30 are not. In general, I would expect this to be less sensitive than OLS to observations with high Cook's distance due to the fairly strong default priors in rstanarm. — Ben Goodrich, Sep 19 '16 at 20:53
Thanks for taking a look. Do I understand you correctly that I should be looking at the `bayes_xy_loo$loo_out$pareto_k` output for information about the importance of individual data points in the model? — Ben, Sep 19 '16 at 21:51
Yes, albeit with an index for the model in `bayes_xy_loo$loo_out` before extracting the `pareto_k` element. — Ben Goodrich, Sep 19 '16 at 22:27
Ok, thanks very much. I've just had [a chat with one of the stan maintainers](https://github.com/dgrtwo/broom/issues/162#issuecomment-248142465) and they've said the same thing. Do you want to add that as an answer? — Ben, Sep 19 '16 at 22:30
Ben Goodrich is one of the core developers for Stan, RStan, and RStanArm. We usually don't concentrate on evaluating point estimates in a Bayesian framework, which is what you seem to be trying to do. So we'd probably want to get a posterior over any evaluation statistics. One thing you could do is leave-one-out prediction of the data points and see which ones aren't well predicted by the model based on the other data points---it's a kind of posterior predictive check. — Bob Carpenter, Sep 20 '16 at 06:00

linear model diagnostics for Bayesian models using rstan

2 Answers2