I'm looking for an efficient method to identify data points that have an outsize effect on the parameters of a linear model. This is straight-forward with ordinary linear models, but I'm not sure how to do it with Bayesian linear models.
Here's one way with ordinary linear models, we can compute the Cook's distance for each data point, and plot diagnostic plots that include Cook's distances:
# ordinary linear model diagnostics, similar to my use-case
library(dplyr)
library(purrr)
library(tidyr)
library(broom)
# make linear models for each type of species
xy <-
iris %>%
nest(-Species) %>%
mutate(model = map(data,
~lm(Sepal.Length ~ Petal.Width,
data = .)),
fit = map(model, augment))
Here we have a data frame with nested lists, the model
column contains the linear model for each species:
> xy
# A tibble: 3 × 4
Species data model fit
<fctr> <list> <list> <list>
1 setosa <tibble [50 × 4]> <S3: lm> <data.frame [50 × 9]>
2 versicolor <tibble [50 × 4]> <S3: lm> <data.frame [50 × 9]>
3 virginica <tibble [50 × 4]> <S3: lm> <data.frame [50 × 9]>
The broom::augment
function allowed us to add the Cook's distance values for each data point into this data frame, and we can inspect them like this:
# inspect Cook's distance values
xy %>%
unnest(fit) %>%
arrange(desc(.cooksd))
# A tibble: 150 × 10
Species Sepal.Length Petal.Width .fitted .se.fit .resid .hat .sigma .cooksd
<fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 versicolor 5.9 1.8 6.612097 0.16181001 -0.7120969 0.13725081 0.4269862 0.24507448
2 setosa 5.0 0.6 5.335281 0.17114108 -0.3352811 0.25027563 0.3410686 0.21385214
3 virginica 4.9 1.7 6.375829 0.13613717 -1.4758292 0.04875277 0.5826838 0.15434787
4 setosa 5.7 0.4 5.149247 0.08625887 0.5507534 0.06357957 0.3355980 0.09396588
5 setosa 4.3 0.1 4.870195 0.08321347 -0.5701948 0.05916942 0.3349111 0.09285408
6 virginica 5.8 2.4 6.831411 0.14828703 -1.0314106 0.05784319 0.6035012 0.09117693
7 virginica 7.2 1.6 6.310746 0.16207266 0.8892538 0.06909799 0.6084108 0.08293253
8 versicolor 4.9 1.0 5.471005 0.11998077 -0.5710051 0.07546185 0.4328038 0.07544526
9 setosa 5.8 0.2 4.963212 0.05287342 0.8367879 0.02388828 0.3228858 0.07500610
10 versicolor 6.0 1.0 5.471005 0.11998077 0.5289949 0.07546185 0.4340307 0.06475225
# ... with 140 more rows, and 1 more variables: .std.resid <dbl>
And with the autoplot
method, we can make informative diagnostic plots that show Cook's distance values, and help us to quickly identify data points with an outsize effect on the models' paramters:
# plot model diagnostics
library(ggplot2)
library(ggfortify)
diagnostic_plots_df <-
xy %>%
mutate(diagnostic_plots = map(model,
~autoplot(.,
which = 1:6,
ncol = 3,
label.size = 3)))
Here's just one of the plots produced:
> diagnostic_plots_df[[1]]
Now, with the Bayesian linear model we can similarly compute linear models for each group in the data frame:
# Bayesian linear model diagnostics
library(rstanarm)
bayes_xy <-
iris %>%
nest(-Species) %>%
mutate(model = map(data,
~stan_lm(Sepal.Length ~ Petal.Width,
data = .,
prior = NULL,
chains = 1,
cores = 2,
seed = 1)),
fit = map(model, augment))
> bayes_xy
# A tibble: 3 × 4
Species data model fit
<fctr> <list> <list> <list>
1 setosa <tibble [50 × 4]> <S3: stanreg> <data.frame [50 × 5]>
2 versicolor <tibble [50 × 4]> <S3: stanreg> <data.frame [50 × 5]>
3 virginica <tibble [50 × 4]> <S3: stanreg> <data.frame [50 × 5]>
But the broom::augment
method doesn't have anything like a Cook's distance value:
# inspect fit diagnostics
bayes_xy %>% unnest(fit)
# A tibble: 150 × 6
Species Sepal.Length Petal.Width .fitted .se.fit .resid
<fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.1 0.2 4.963968 0.06020298 0.13482025
2 setosa 4.9 0.2 4.963968 0.06020298 -0.06517975
3 setosa 4.7 0.2 4.963968 0.06020298 -0.26517975
4 setosa 4.6 0.2 4.963968 0.06020298 -0.36517975
5 setosa 5.0 0.2 4.963968 0.06020298 0.03482025
6 setosa 5.4 0.4 5.151501 0.11299956 0.21818386
7 setosa 4.6 0.3 5.057734 0.05951488 -0.47349794
8 setosa 5.0 0.2 4.963968 0.06020298 0.03482025
9 setosa 4.4 0.2 4.963968 0.06020298 -0.56517975
10 setosa 4.9 0.1 4.870201 0.11408783 0.04313845
# ... with 140 more rows
And no autoplot
method:
# plot model diagnostics
bayes_diagnostic_plots_df <-
bayes_xy %>%
mutate(diagnostic_plots = map(model,
~autoplot(.,
which = 1:6,
ncol = 3,
label.size = 3)))
# Error, there doesn't seem to be an autoplot method for stan_lm output
shinystan::launch_shinystan(bayes_xy$model[[1]])
# This is quite interesting, but nothing at the level of specific data points
Some of the scholarly literature talks about methods such as model perturbations, phi-divergence, Cook's posterior mode distance and Cook's posterior mean distance, Kullback-Leibler divergence, etc., etc.. But I can't see anywhere this has been explored with R code, and I'm stuck.
There is an unanswered question on this topic at Cross-validated. I'm posting here because I'm looking for ideas about writing code to compute influence statistics (rather than advice about the statistical theory and approach, etc., which should go at that other question)
How can I get something like a Cook's Distance measure from the output of rstanarm::stan_lm
?