Is there a reason the xgboost code snippet from the usemodels package has one_hot set to TRUE?

Question

Is there a reason the recipe code snippet for xgboost classifier has one_hot = TRUE? This creates "n" dummy variables instead of "n-1". I usually set it to FALSE but just want to make sure I'm not missing something.

Code -

data <- mtcars %>% 
  as_tibble() %>%  
  mutate(cyl = cyl %>% as.factor)

usemodels::use_xgboost(mpg ~ cyl, data = data)

Output -

xgboost_recipe <- 
  recipe(formula = mpg ~ cyl, data = data) %>% 
  step_novel(all_nominal(), -all_outcomes()) %>% 
  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>% 
  step_zv(all_predictors()) 

xgboost_spec <- 
  boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(), 
    loss_reduction = tune(), sample_size = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("xgboost") 

xgboost_workflow <- 
  workflow() %>% 
  add_recipe(xgboost_recipe) %>% 
  add_model(xgboost_spec) 

set.seed(28278)
xgboost_tune <-
  tune_grid(xgboost_workflow, resamples = stop("add your rsample object"), grid = stop("add number of candidate points"))

score 0 · Accepted Answer · answered Apr 07 '21 at 16:10

The idea there is that, as a tree-based model, xgboost can handle all the levels (unlike a linear model) and can actually require more splits to fit well if you don't include all the categories. Read more about this here.

You don't see the same for the ranger random forest because it can handle factors natively.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
cars <- as_tibble(mtcars) %>%  
  mutate(cyl = cyl %>% as.factor)

usemodels::use_ranger(mpg ~ cyl, data = cars)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
#> ranger_recipe <- 
#>   recipe(formula = mpg ~ cyl, data = cars) 
#> 
#> ranger_spec <- 
#>   rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
#>   set_mode("regression") %>% 
#>   set_engine("ranger") 
#> 
#> ranger_workflow <- 
#>   workflow() %>% 
#>   add_recipe(ranger_recipe) %>% 
#>   add_model(ranger_spec) 
#> 
#> set.seed(54153)
#> ranger_tune <-
#>   tune_grid(ranger_workflow, resamples = stop("add your rsample object"), grid = stop("add number of candidate points"))

^{Created on 2021-04-07 by the reprex package (v2.0.0)}

Got it. Thank you so much. – The Rookie Apr 13 '21 at 09:02 — The Rookie, Apr 13 '21 at 09:02

Is there a reason the xgboost code snippet from the usemodels package has one_hot set to TRUE?

1 Answers1