Tidymodels Package: Visualising a random forest model using ggplot() to show the most important predictors

Question

Overview

I am following a tutorial (see below) to find the best fit models from bagged trees, random forests, boosted trees, and general linear models.

Tutorial (see examples below)

https://bcullen.rbind.io/post/2020-06-02-tidymodels-decision-tree-learning-in-r/

Issue

In this case, I would like to explore the data further and visualise the most important predictors (see diagram below) for my data in the random forest model.

My data frame is called FID and the predictors in the random forest model involve:

Year (numeric)
Month (Factor)
Days (numeric)

The dependent variable is Frequency (numeric)

When I try to run the plot to visualise the most important predictor, I keep on getting this error message:-

Error: Problem with `mutate()` input `oob_rmse`.
x non-numeric argument to mathematical function
ℹ Input `oob_rmse` is `map_dbl(fit, ~sqrt(.x$prediction.error))`.
Run `rlang::last_error()` to see where the error occurred.
Called from: signal_abort(cnd)

If anyone has any advice on how to fix the error message, I would be deeply appreciative.

Many thanks in advance

Examples of how to produce the plot from the R-code in the tutorial

Visualise the model

Plot to show the most important predictors from the R-code in the tutorial

My R-code

##Open libraries
library(tidymodels)
library(parsnip)
library(forcats)
library(ranger)
library(baguette)

###########################################################
#split this single dataset into two: a training set and a testing set
data_split <- initial_split(FID)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)

 # resample the data with 10-fold cross-validation (10-fold by default)
  cv <- vfold_cv(train_data, v=3)
###########################################################

##Produce the recipe

rec <- recipe(Frequency ~ ., data = FID) %>% 
          step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
          step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels 
          step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars"))  %>% # replaces missing numeric observations with the median
          step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables

###################################################################################


    ###################################################
    ##Random forests
    ###################################################
    
    mod_rf <-rand_forest(trees = 1e3) %>%
                                  set_engine("ranger",
                                  num.threads = parallel::detectCores(), 
                                  importance = "permutation", 
                                  verbose = TRUE) %>% 
                                  set_mode("regression") 
                                  
    ##Create Workflow
    
    wflow_rf <- workflow() %>% 
                   add_model(mod_rf) %>% 
                         add_recipe(rec)
    
    ##Fit the model
    
    plan(multisession)
    
    fit_rf<-fit_resamples(
                 wflow_rf,
                 cv,
                 metrics = metric_set(rmse, rsq),
                 control = control_resamples(save_pred = TRUE,
                                             extract = function(x) extract_model(x)))
    
    
    # extract roots
    rf_tree_roots <- function(x){
                         map_chr(1:1000, 
                            ~ranger::treeInfo(x, tree = .)[1, "splitvarName"])
                                }
    
    rf_roots <- function(x){
                           x %>% 
                            dplyr::select(.extracts) %>% 
                            unnest(cols = c(.extracts)) %>% 
                            dplyr::mutate(fit = map(.extracts,
                            ~.x$fit$fit$fit),
                            oob_rmse = map_dbl(fit,
                                  ~sqrt(.x$prediction.error)),
                             roots = map(fit, 
                            ~rf_tree_roots(.))
                                   ) %>% 
                            dplyr::select(roots) %>% 
                            unnest(cols = c(roots))
                            }
    
    ##Open a plotting window
    dev.new()
    
    # plot
    rf_roots(fit_rf) %>% 
                group_by(roots) %>% 
                count() %>% 
                dplyr::arrange(desc(n)) %>% 
                dplyr::filter(n > 75) %>% 
                ggplot(aes(fct_reorder(roots, n), n)) +
                geom_col() + 
                coord_flip() + 
                labs(x = "root", y = "count")

##Error message

Error: Problem with `mutate()` input `oob_rmse`.
x non-numeric argument to mathematical function
ℹ Input `oob_rmse` is `map_dbl(fit, ~sqrt(.x$prediction.error))`.
Run `rlang::last_error()` to see where the error occurred.
Called from: signal_abort(cnd)

Data frame - FID

  structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015, 
2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016, 
2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017, 
2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March", 
"April", "May", "June", "July", "August", "September", "October", 
"November", "December"), class = "factor"), Frequency = c(36, 
28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 33, 33, 29, 31, 23, 8, 9, 
7, 40, 41, 41, 30, 30, 44, 37, 41, 42, 20, 0, 7, 27, 35, 27, 
43, 38), Days = c(31, 28, 31, 30, 6, 0, 0, 29, 15, 
29, 29, 31, 31, 29, 30, 30, 7, 0, 7, 30, 30, 31, 30, 27, 31, 
28, 30, 30, 21, 0, 7, 26, 29, 27, 29, 29)), row.names = c(NA, 
-36L), class = "data.frame")

It would be helpful if the code was reproducible. For instance, you did not mention the needed packages (`library(...)`). In addition, in `recipe()` you mention a variable, `Frequency_Blue`, which is not part of your data set. — Sebastian Sauer, Nov 21 '20 at 11:45

StupidWolf · Accepted Answer · 2020-11-22T12:54:23.043

2

It was not about to extract the errors correctly, if you look at the tibble containing all the models:

fit_rf$.extracts
[[1]]
# A tibble: 1 x 1
  .extracts
  <list>   
1 <ranger>

It's embedded within a list or list, but there's no names:

names(fit_rf$.extracts[[1]][[1]])
NULL

Hence this part will fail:

map(fit_rf$.extracts,~.x$fit$fit$fit)

If you look at the structure after the first unnest, this is already the fit:

fit_rf %>% dplyr::select(.extracts) %>% unnest(cols = c(.extracts)) 
# A tibble: 3 x 1
  .extracts
  <list>   
1 <ranger> 
2 <ranger> 
3 <ranger>

So we can do:

rf_roots <- function(x){
                       x %>% 
                       select(.extracts) %>% 
                       unnest(cols = c(.extracts)) %>% 
                       mutate(oob_rmse = map_dbl(.extracts,
                                  ~sqrt(.x$prediction.error)),
                              roots = map(.extracts, 
                                  ~rf_tree_roots(.))
                               ) %>% 
                        dplyr::select(roots) %>% 
                        unnest(cols = c(roots))
                        }

This will work now:

rf_roots(fit_rf)
# A tibble: 3,000 x 1
   roots          
   <chr>          
 1 Month_August   
 2 Year           
 3 Month_July     
 4 Month_September
 5 Month_December 
 6 Month_March    
 7 Month_July     
 8 Month_September
 9 Month_December 
10 Days

Add-on: If the objective is to get the root variable for each tree in each model, one can simply do:

root_vars = unnest(fit_rf,.extracts) %>% 
pull(.extracts) %>% 
map(rf_tree_roots)

Or in base R:

lapply(fit_rf$.extracts,function(i)rf_tree_roots(i[[1]][[1]]))

And you can easily unlist this to make a barplot.

edited Nov 22 '20 at 12:54

answered Nov 21 '20 at 12:09

StupidWolf

45,075
17
40
72

Thank you StupidWolf! I was happy to receive your help, it was deeply appreciated. I wanted to undergo a process of model comparison by pinpointing the model with the best fit. In this tutorial (link above) the author is using rsme values to identify the best fit model with the lowest rsme value. I am still learning – Alice Hobbs Nov 22 '20 at 02:49
Out of interest, what method would you use? – Alice Hobbs Nov 22 '20 at 02:49
Hi @AliceHobbs, sorry i should have explained further, in ```rf_roots``` both rmse and roots are extracted, but only roots are returned, making the former redundant – StupidWolf Nov 22 '20 at 11:05
After reading the post, the rmse is used at https://bcullen.rbind.io/post/2020-06-02-tidymodels-decision-tree-learning-in-r/#evaluate-metrics but has nothing to do with this function – StupidWolf Nov 22 '20 at 11:09
I just think more thought could have gone into the code, or how the data is stored. A lot of the map, select, unnest could be avoided. Most likely some of the parts will throw a bug in the future, with dplyr and tidyr still evolving.. Make sense to rewrite some of the code once one is familiar with what it's doing – StupidWolf Nov 22 '20 at 11:30
OK! The function could have been broken down into smaller simpler steps but still produce the same result. I will read about the function map(). Can you suggest some simpler code or recommendations? This is a good learning curve for me. Thank you for your advice.... – Alice Hobbs Nov 22 '20 at 12:26

Tidymodels Package: Visualising a random forest model using ggplot() to show the most important predictors

1 Answers1