2

Overview

I have produced a random forest regression model, and, my aim is to fit the model using the function fit_samples() function, and then tune the hyperparameters. However, I am experiencing this error message below:

Error Message:

   ! Fold01: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
   x Fold01: internal: Error: Must group by variables found in `.data`.
   * Column `mtry` is not found.

   ! Fold02: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
   x Fold02: internal: Error: Must group by variables found in `.data`.
   * Column `mtry` is not found.

   ! Fold03: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
   x Fold03: internal: Error: Must group by variables found in `.data`.
   * Column `mtry` is not found.

I have done an online search for a solution, but I cannot find a question that aligns with my particular issue. I am not an advanced R user, and I am trying my very best to slowly manoeuvre myself through the Tidymodels package

If anyone can help with this error message, I would be deeply appreciative.

Many thanks in advance

R-code

   seed(45L)

   #Open libraries
   library(tidymodels)
   library(ranger)
   library(dplyr)

   #split this single dataset into two: a training set and a testing set
   data_split <- initial_split(FID)
   #Create data frames for the two sets:
   train_data <- training(data_split)
   test_data  <- testing(data_split)

  #resample the data with 10-fold cross-validation (10-fold by default)
  cv <- vfold_cv(train_data, v=10)

 ###########################################################
 ##Produce the recipe

  rec <- recipe(Frequency ~ ., data = FID) %>% 
  step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
  step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels 
  step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars"))  %>% # replaces missing numeric observations with the median
  step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables

  #Produce the random forest model

       mod_rf <- rand_forest(
                            mtry = tune(),
                            trees = 1000,
                            min_n = tune()
                             ) %>%
                           set_mode("regression") %>%
                           set_engine("ranger")  

   ##Workflow
      wflow_rf <- workflow() %>% 
                            add_model(mod_rf) %>% 
                                        add_recipe(rec)

    ##Fit model

     plan(multisession)

     fit_rf<-fit_resamples(
                        wflow_rf,
                        cv,
                        metrics = metric_set(rmse, rsq),
                        control = control_resamples(save_pred = TRUE,
                        extract = function(x) extract_model(x)))

   #Error Message

   ! Fold01: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
   x Fold01: internal: Error: Must group by variables found in `.data`.
   * Column `mtry` is not found.

   ! Fold02: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
   x Fold02: internal: Error: Must group by variables found in `.data`.
   * Column `mtry` is not found.

   ! Fold03: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
   x Fold03: internal: Error: Must group by variables found in `.data`.
   * Column `mtry` is not found.

Data Frame FID

structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015, 
2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016, 
2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017, 
2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March", 
"April", "May", "June", "July", "August", "September", "October", 
"November", "December"), class = "factor"), Frequency = c(36, 
28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 33, 33, 29, 31, 23, 8, 9, 
7, 40, 41, 41, 30, 30, 44, 37, 41, 42, 20, 0, 7, 27, 35, 27, 
43, 38), Days = c(31, 28, 31, 30, 6, 0, 0, 29, 15, 
29, 29, 31, 31, 29, 30, 30, 7, 0, 7, 30, 30, 31, 30, 27, 31, 
28, 30, 30, 21, 0, 7, 26, 29, 27, 29, 29)), row.names = c(NA, 
-36L), class = "data.frame")
Alice Hobbs
  • 1,021
  • 1
  • 15
  • 31

1 Answers1

3

If you check the help page for fit_resamples:

fit_resamples() computes a set of performance metrics across one or more resamples. It does not perform any tuning (see tune_grid() and tune_bayes() for that)

Most likely you need to tune first, and then run fit_resamples() using the parameters obtained from the tuning, for example:

rf_grid <- expand.grid(mtry = 2:4, min_n = c(10,15,20))

mod_rf <- rand_forest(
                      mtry = tune(),
                      trees = 1000,
                      min_n = tune()
                      ) %>%
                      set_mode("regression") %>%
                      set_engine("ranger")  

wflow_rf <- workflow() %>% 
            add_model(mod_rf) %>% 
            add_recipe(rec)

rf_res <- 
  wflow_rf %>% 
  tune_grid(
    resamples = cv,grid = rf_grid
    )

Find the best parameter:

show_best(rf_res,metric="rmse")
# A tibble: 5 x 7
   mtry min_n .metric .estimator  mean     n std_err
  <int> <dbl> <chr>   <chr>      <dbl> <int>   <dbl>
1     4    10 rmse    standard    7.87    10   0.743
2     4    15 rmse    standard    8.27    10   0.649
3     3    10 rmse    standard    8.49    10   0.682
4     3    15 rmse    standard    8.97    10   0.620
5     4    20 rmse    standard    9.49    10   0.605

And run it again:

mod_rf <- rand_forest(mtry = 4,trees = 1000,min_n = 10) %>%
          set_mode("regression") %>%
          set_engine("ranger")  

wflow_rf <- workflow() %>% 
            add_model(mod_rf) %>% 
            add_recipe(rec)

fit_rf<-fit_resamples(
                    wflow_rf,
                    cv,
                    metrics = metric_set(rmse, rsq),
                    control = control_resamples(save_pred = TRUE,
                    extract = function(x) extract_model(x)))
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • Hey StupidWolf. Thank you so much for your help. I have been trying to solve my issues for days and ended up confusing myself. At a further point in my code, I have created a figure with tree_depth on the x-axis and the mean rmse and rsq values on the y-axis (produces two plots in one figure). Is there any way that tree_depth() can be incorporated back into the random forest model? Sorry to ask more questions, I hope you don't think I'm overstepping, but I have produced this plot for my other three models. – Alice Hobbs Dec 19 '20 at 05:21
  • 1
    I am very very grateful for your help. With my previous incorrect code extracted tree_depth when I was using the function collect_metrics(). – Alice Hobbs Dec 19 '20 at 05:23
  • Hi @AliceHobbs, no problem, so you want to tune the tree depth (which is max.depth in ranger) or you want to obtain the depth of individual trees from the final model – StupidWolf Dec 19 '20 at 06:04
  • It's probably the depth of individual trees because the plot is plotting all trees with their mean rsme and rsq when I use collect_metrics() from the tuning model using tune_grid() – Alice Hobbs Dec 19 '20 at 06:08
  • Hmmm. Trying to clarify things. I guess you are comparing with other tree based models, for example rpart or gbm, and these models have tree_depth() as a tuning parameter. Your question is whether ranger has the option to tune this.. – StupidWolf Dec 19 '20 at 06:13
  • I think you cannot tune tree_depth() – StupidWolf Dec 19 '20 at 06:23
  • I was just looking and I don't think so either. Do you think I will need to set trees to trees = tune()? – Alice Hobbs Dec 19 '20 at 06:31
  • I was just reading and I was thinking about the argument: importance = "permutation" or importance = "impurity" in the set_engine() function. My last incorrect model contained: importance = "permutation" – Alice Hobbs Dec 19 '20 at 06:36
  • This is the version of the incorrect random forest model: mod_rf <-rand_forest(trees = 1e3) %>% set_engine("ranger", num.threads = parallel::detectCores(), importance = "permutation", verbose = TRUE) %>% set_mode("regression") – Alice Hobbs Dec 19 '20 at 06:39
  • The ```trees=``` option decides how many decision trees to make in the random forest model, yes you can tune that, but it has nothing to do with tree_depth() like you see with other models – StupidWolf Dec 19 '20 at 06:57
  • Ok! I understand! I don’t understand how I managed to incorporate tree_depth with the other code – Alice Hobbs Dec 19 '20 at 08:56
  • https://stackoverflow.com/questions/44291685/what-is-equivalent-of-max-depth-in-the-r-package-ranger This is an interesting read – Alice Hobbs Dec 19 '20 at 09:00