3

I'm using the caret package to predict a time series with method treebag. caret estimates bagging regression trees with 25 bootstrap replications.

What I'm struggling to understand is how the final prediction of that 'treebag model' relates to the predictions made by each of the 25 trees, depending on whether I use caret::preProcess, or not.

I am aware of this question and the linked resources therein. (But could not draw the right conclusions from it.)

Here is an example using the economics data. Let's say I want to predict unemploy_rate, which has to be created first.

# packages
library(caret)
library(tidyverse)

# data
data("economics")

economics$unemploy_rate <- economics$unemploy / economics$pop * 100
x <- economics[, -c(1, 7)]
y <- economics[["unemploy_rate"]]

I wrote a function that extracts the 25 individual trees from the train object, makes a prediction for each tree, averages these 25 predictions, and compares this average with the prediction from the train object. It returns a plot.

predict_from_treebag <- function(model) {
  # extract 25 trees from train object
  bagged_trees <- map(.x = model$finalModel$mtrees, .f = pluck, "btree")

  # make a prediction for each tree
  pred_trees <- map(bagged_trees, .f = predict, newdata = x)
  names(pred_trees) <- paste0("tree_", seq_along(pred_trees))

  # aggreagte predictions
  pred_trees <- as.data.frame(pred_trees) %>%
    add_column(date = economics$date, .before = 1) %>%
    gather(tree, value, matches("^tree")) %>%
    group_by(date) %>%
    mutate(mean_pred_from_trees = mean(value)) %>%
    ungroup()

  # add prediction from train object
  pred_trees$bagging_model_prediction = predict(model, x)
  pred_trees <- pred_trees %>%
    gather(model, pred_value, 4:5)

  # plot
  p <- ggplot(data = pred_trees, aes(date)) +
        geom_line(aes(y = value, group = tree), alpha = .2) +
        geom_line(aes(y = pred_value, col = model)) +
        theme_minimal() +
        theme(
         panel.grid.major = element_blank(),
         panel.grid.minor = element_blank(),
         legend.position = "bottom"
        )

  p

}

Now I estimate two models, the first will be unscaled, the second will be centered and scaled.

preproc_opts <- list(unscaled = NULL,
                     scaled = c("center", "scale"))

# estimate the models
models <- map(preproc_opts, function(preproc)
    train(
    x = x,
    y = y,
    trControl = trainControl(method = "none"), # since there are no tuning parameters for this model
    metric = "RMSE",
    method = "treebag",
    preProcess = preproc
))

# apply predict_from_treebag to each model
imap(.x = models,
     .f = ~{predict_from_treebag(.x) + labs(title = .y)})

The results are shown below. The unscaled model prediction is the average of the 25 trees but why is each prediction from the 25 trees a constant when I use preProcess?

Thank you for any advice where I might be wrong.

enter image description here

enter image description here

markus
  • 25,843
  • 5
  • 39
  • 58

1 Answers1

2

The problem is in this part of the code:

pred_trees <- map(bagged_trees, .f = predict, newdata = x)

in the function predict_from_treebag

this predict function is in fact predict.rpart since

class(bagged_trees[[1]])

predict.rpart does not know that you pre-processed the data in caret.

Here is a quick fix:

predict_from_treebag <- function(model) {
  # extract 25 trees from train object
  bagged_trees <- map(.x = model$finalModel$mtrees, .f = pluck, "btree")
  x <- economics[, -c(1, 7)]
  # make a prediction for each tree

  newdata = if(is.null(model$preProcess)) x else predict(model$preProcess, x)
  pred_trees <- map(bagged_trees, .f = predict, newdata = newdata)
  names(pred_trees) <- paste0("tree_", seq_along(pred_trees))

  # aggreagte predictions
  pred_trees <- as.data.frame(pred_trees) %>%
    add_column(date = economics$date, .before = 1) %>%
    gather(tree, value, matches("^tree")) %>%
    group_by(date) %>%
    mutate(mean_pred_from_trees = mean(value)) %>%
    ungroup()

  # add prediction from train object
  pred_trees$bagging_model_prediction = predict(model, x)
  pred_trees <- pred_trees %>%
    gather(model, pred_value, 4:5)

  # plot
  p <- ggplot(data = pred_trees, aes(date)) +
    geom_line(aes(y = value, group = tree), alpha = .2) +
    geom_line(aes(y = pred_value, col = model)) +
    theme_minimal() +
    theme(
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      legend.position = "bottom"
    )

  p
}

Now after running:

preproc_opts <- list(unscaled = NULL,
                     scaled = c("center", "scale"))

models <- map(preproc_opts, function(preproc)
  train(
    x = x,
    y = y,
    trControl = trainControl(method = "none"), # since there are no tuning parameters for this model
    metric = "RMSE",
    method = "treebag",
    preProcess = preproc
  ))

map2(.x = models,
     .y = names(models),
     .f = ~{predict_from_treebag(.x) + labs(title = .y)})

the result is in line with the expected

enter image description here enter image description here

missuse
  • 19,056
  • 3
  • 25
  • 47
  • 1
    @markus Glad to help. I added an updated `predict_from_treebag` function that will work on any `preProcess` called within train. – missuse Jan 05 '18 at 12:49