Regression trees with tidymodels

Question

When attempting to use Regression Trees, how do you determine if/ when to use tune_grid() versus fit_resamples()?

I tried these two things:

1.

using tune_grid

tune_spec<- decision_tree(min_n= tune(), tree_depth= tune(), cost_complexity=tune()) %>% set_engine("rpart") %>% set_mode("regression")
tree_grid<- tune_spec %>% extract_parameter_set_dials() %>% grid_regular(levels=3)
set.seed(275)
folds<- vfold_cv(train_set, v=3)
tune_results<- tune_grid(tune_spec, outcome~., resamples= folds, grid= tree_grid, metrics= metric_set(rmse))

That resulted in following error:

factor has new levels... there were issues with some computations

2.

using fit_resamples

tune_results<- fit_resamples(tune_spec, outcome~., resamples= folds, grid= tree_grid, metrics= metric_set(rmse))

That resulted in Error:

! 3 arguments have been tagged for tuning in these components: model_spec. 
Please use one of the tuning functions (e.g. `tune_grid()`) to optimize them.

Before I try to figure out what's going wrong, I'd like to know which one I'm supposed to be using in the first place.

You should use `tune_grid` or tuning function from the `finetune` package if there are parameters which you can't estimate from the data (ie hyperparameters). `tune_grid` is doing what `fit_resamples` does, except your resamples are fit to a range of models. If you want to use `fit_resamples` you need to replace any `tune()` placeholder with a value in your model specification or pre-processing recipe. If you want to troubleshoot the error in the first code block, it would be helpful to share a reproducible example. — Seth, Jun 18 '23 at 01:43
Additionally, the new factor level error means that when you created your cv folds, there were some that ended up with a strict subset of some of the levels of a categorical variable. It's important that the data in each fold be representative of the full data, so you may need to generate your cv folds with some form of stratified sampling so that each subset is "balanced" in terms of factor levels. — joran, Jun 18 '23 at 02:59

score 0 · Answer 1 · answered Jun 19 '23 at 18:25

You should use fit_resamples() if you don't have any arguments to tune(). Otherwise you should use tune_grid() or finetune variants.

So in your situation, since you have used tune(), you want to use tune_grid(). Which you did. but you are getting the error factor has new levels... there were issues with some computations. This is happening because some of your predictors, are categorical, and when then model is being fit inside the tune_grid() it is first trained on the analysis data set, then it predicts on the corresponding assessment data set. One or more of the categorical variables had levels only appear in the assessment data set.

One way to deal with this is to use recipes to do preprocessing. The step step_novel() was created to deal with this exact problem.

Then your code would look this this, where I used a workflow() to combine the recipe and the model specification.

rec_spec <- recipe(outcome ~ ., data = train_set) %>%
  step_novel(all_nominal_predictors())

tune_spec <- decision_tree(
    min_n = tune(), tree_depth = tune(), cost_complexity = tune()
  ) %>% 
  set_engine("rpart") %>% 
  set_mode("regression")

wf_spec <- workflow(rec_spec, tune_spec)

tree_grid <- wf_spec %>% 
  extract_parameter_set_dials() %>% 
  grid_regular(levels = 3)

set.seed(275)
folds <- vfold_cv(train_set, v=3)

tune_results <- tune_grid(
  wf_spec, 
  resamples = folds, 
  grid = tree_grid, 
  metrics = metric_set(rmse)
)

Regression trees with tidymodels

1.

2.

1 Answers1