Question:
What factors may cause the prediction interval to have wider coverage than would be expected? Particularly with regard to quantile regression forests with the ranger package?
Specific Context + REPREX:
I am using quantile regression forests through parsnip and the tidymodels suite of packages with ranger
to generate prediction intervals. I was reviewing an example using the ames
housing data and was surprised to see in the example below that my 90% prediction intervals had an empirical coverage of ~97% when evaluated on a hold-out dataset (coverage on the training data was even higher).
This was even more surprising given that my model performance is substantially worse on the hold-out set than on the training set hence I would have guessed the coverage would have been less than expected, not greater than expected?
Load libraries, data, set-up split:
```{r}
library(tidyverse)
library(tidymodels)
library(AmesHousing)
ames <- make_ames() %>%
mutate(Years_Old = Year_Sold - Year_Built,
Years_Old = ifelse(Years_Old < 0, 0, Years_Old))
set.seed(4595)
data_split <- initial_split(ames, strata = "Sale_Price", p = 0.75)
ames_train <- training(data_split)
ames_test <- testing(data_split)
```
Specify model workflow:
```{r}
rf_recipe <-
recipe(
Sale_Price ~ Lot_Area + Neighborhood + Years_Old + Gr_Liv_Area + Overall_Qual + Total_Bsmt_SF + Garage_Area,
data = ames_train
) %>%
step_log(Sale_Price, base = 10) %>%
step_other(Neighborhood, Overall_Qual, threshold = 50) %>%
step_novel(Neighborhood, Overall_Qual) %>%
step_dummy(Neighborhood, Overall_Qual)
rf_mod <- rand_forest() %>%
set_engine("ranger", importance = "impurity", seed = 63233, quantreg = TRUE) %>%
set_mode("regression")
set.seed(63233)
rf_wf <- workflows::workflow() %>%
add_model(rf_mod) %>%
add_recipe(rf_recipe) %>%
fit(ames_train)
```
Make predictions on training and hold-out datasets:
```{r}
rf_preds_train <- predict(
rf_wf$fit$fit$fit,
workflows::pull_workflow_prepped_recipe(rf_wf) %>% bake(ames_train),
type = "quantiles",
quantiles = c(0.05, 0.50, 0.95)
) %>%
with(predictions) %>%
as_tibble() %>%
set_names(paste0(".pred", c("_lower", "", "_upper"))) %>%
mutate(across(contains(".pred"), ~10^.x)) %>%
bind_cols(ames_train)
rf_preds_test <- predict(
rf_wf$fit$fit$fit,
workflows::pull_workflow_prepped_recipe(rf_wf) %>% bake(ames_test),
type = "quantiles",
quantiles = c(0.05, 0.50, 0.95)
) %>%
with(predictions) %>%
as_tibble() %>%
set_names(paste0(".pred", c("_lower", "", "_upper"))) %>%
mutate(across(contains(".pred"), ~10^.x)) %>%
bind_cols(ames_test)
```
Show that coverage rate is far higher for both the training and hold-out data than the 90% expected (empirically seems to be ~98% and ~97% respectively):
```{r}
rf_preds_train %>%
mutate(covered = ifelse(Sale_Price >= .pred_lower & Sale_Price <= .pred_upper, 1, 0)) %>%
summarise(n = n(),
n_covered = sum(
covered
),
covered_prop = n_covered / n,
stderror = sd(covered) / sqrt(n)) %>%
mutate(min_coverage = covered_prop - 2 * stderror,
max_coverage = covered_prop + 2 * stderror)
# # A tibble: 1 x 6
# n n_covered covered_prop stderror min_coverage max_coverage
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2199 2159 0.982 0.00285 0.976 0.988
rf_preds_test %>%
mutate(covered = ifelse(Sale_Price >= .pred_lower & Sale_Price <= .pred_upper, 1, 0)) %>%
summarise(n = n(),
n_covered = sum(
covered
),
covered_prop = n_covered / n,
stderror = sd(covered) / sqrt(n)) %>%
mutate(min_coverage = covered_prop - 2 * stderror,
max_coverage = covered_prop + 2 * stderror)
# # A tibble: 1 x 6
# n n_covered covered_prop stderror min_coverage max_coverage
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 731 706 0.966 0.00673 0.952 0.979
```
Guesses:
- Something about the
ranger
package or quantile regression forests is overly extreme in the way it estimates quantiles, or I am overfitting in the 'extreme' direction somehow -- leading to my highly conservative prediction intervals - This is a quirk specific to this dataset / model
- I am missing something or setting-up something incorrectly