fit_resamples with ranger package fails

Question

try to use crossfold resampling and fit a random forest from the ranger package. The fit without resampling works but once I try a resample fit it fails with error below.

Consider following df

df<-structure(list(a = c(1379405931, 732812609, 18614430, 1961678341, 
2362202769, 55687714, 72044715, 236503454, 61988734, 2524712675, 
98081131, 1366513385, 48203585, 697397991, 28132854), b = structure(c(1L, 
6L, 2L, 5L, 7L, 8L, 8L, 1L, 3L, 4L, 3L, 5L, 7L, 2L, 2L), .Label = c("CA", 
"IA", "IL", "LA", "MA", "MN", "TX", "WI"), class = "factor"), 
    c = structure(c(2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 
    2L, 2L, 2L, 1L), .Label = c("R", "U"), class = "factor"), 
    d = structure(c(3L, 3L, 1L, 3L, 3L, 1L, 1L, 3L, 1L, 3L, 1L, 
    3L, 2L, 3L, 1L), .Label = c("CAH", "LTCH", "STH"), class = "factor"), 
    e = structure(c(3L, 2L, 3L, 3L, 1L, 3L, 3L, 3L, 2L, 4L, 2L, 
    2L, 3L, 3L, 3L), .Label = c("cancer", "general long term", 
    "psychiatric", "rehabilitation"), class = "factor")), row.names = c(NA, 
-15L), class = c("tbl_df", "tbl", "data.frame"))

Following simple fit works without issues

library(tidymodels)
library(ranger)

rf_spec <- rand_forest(mode = 'regression') %>% 
  set_engine('ranger')


rf_spec %>% 
  fit(a ~. , data = df)

But as soon as I want to run the cross validation via

rf_folds <- vfold_cv(df, strata = c)

fit_resamples(a ~ . ,
              rf_spec,
              rf_folds)

Following error

model: Error in parse.formula(formula, data, env = parent.frame()): Error: Illegal column names in formula interface. Fix column names or use alternative interface in ranger.

Seems to be an issue with values with spaces within a column being turned into a dummy variable [see here](https://github.com/tidymodels/tune/issues/151) — CER, Mar 10 '20 at 03:49

score 5 · Accepted Answer · answered Mar 20 '20 at 21:19

The commenter above is correct that the source of the issue here is the spaces in the factor column. The functions for resampling and the functions for just plain old fitting currently handle that differently, and we are actively looking into how to solve this problem for users. Thank you for your patience!

In the meantime, I would recommend setting up a simple workflow() plus a recipe(), which together will handle all the necessary dummy variable munging for you.

library(tidymodels)

rf_spec <- rand_forest(mode = "regression") %>% 
  set_engine("ranger")

rf_wf <- workflow() %>%
  add_model(rf_spec) %>%
  add_recipe(recipe(a ~ ., data = df))


fit(rf_wf, data = df)
#> ══ Workflow [trained] ═══════════════════════════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: rand_forest()
#> 
#> ── Preprocessor ─────────────────────────────────────────────────────────────────────────────────────────────────
#> 0 Recipe Steps
#> 
#> ── Model ────────────────────────────────────────────────────────────────────────────────────────────────────────
#> Ranger result
#> 
#> Call:
#>  ranger::ranger(formula = formula, data = data, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1)) 
#> 
#> Type:                             Regression 
#> Number of trees:                  500 
#> Sample size:                      15 
#> Number of independent variables:  4 
#> Mtry:                             2 
#> Target node size:                 5 
#> Variable importance mode:         none 
#> Splitrule:                        variance 
#> OOB prediction error (MSE):       4.7042e+17 
#> R squared (OOB):                  0.4341146

rf_folds <- vfold_cv(df, strata = c)

fit_resamples(rf_wf,
              rf_folds)
#> #  10-fold cross-validation using stratification 
#> # A tibble: 9 x 4
#>   splits         id    .metrics         .notes          
#>   <list>         <chr> <list>           <list>          
#> 1 <split [13/2]> Fold1 <tibble [2 × 3]> <tibble [0 × 1]>
#> 2 <split [13/2]> Fold2 <tibble [2 × 3]> <tibble [0 × 1]>
#> 3 <split [13/2]> Fold3 <tibble [2 × 3]> <tibble [0 × 1]>
#> 4 <split [13/2]> Fold4 <tibble [2 × 3]> <tibble [0 × 1]>
#> 5 <split [13/2]> Fold5 <tibble [2 × 3]> <tibble [0 × 1]>
#> 6 <split [13/2]> Fold6 <tibble [2 × 3]> <tibble [0 × 1]>
#> 7 <split [14/1]> Fold7 <tibble [2 × 3]> <tibble [0 × 1]>
#> 8 <split [14/1]> Fold8 <tibble [2 × 3]> <tibble [0 × 1]>
#> 9 <split [14/1]> Fold9 <tibble [2 × 3]> <tibble [0 × 1]>

^{Created on 2020-03-20 by the reprex package (v0.3.0)}

score 3 · Answer 2 · answered Mar 20 '20 at 21:55

Julia bet me to it so she gets the karma. I had the same answer (I do what she does but slower):

This is kind of a bug and we've been working to the best way to make it not error. It's complicated. Let me explain.

ranger is one of a few R packages whose formula method does not create dummy variables (sensibly since it does not need them).

The infrastructure in tune uses the workflows package to process the formula then hand the resulting data over to ranger. By default, workflows does create dummy variables and, since some of your factor levels are not valid R column names (e.g. "general long term"), ranger() kicks an error.

(I know that you didn't use a workflow, but that is what happens under the hood).

We are working on the best thing to do here since most users don't know that many tree-based model packages do not produce dummy variables. To make it more complex, parsnip doesn't use workflows (yet) and did not have given you an error.

Solution for now

Use a simple recipe instead of a formula:

library(tidymodels)
#> ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────── tidymodels 0.1.0 ──
#> ✓ broom     0.5.4      ✓ recipes   0.1.10
#> ✓ dials     0.0.4      ✓ rsample   0.0.5 
#> ✓ dplyr     0.8.5      ✓ tibble    2.1.3 
#> ✓ ggplot2   3.3.0      ✓ tune      0.0.1 
#> ✓ infer     0.5.1      ✓ workflows 0.1.0 
#> ✓ parsnip   0.0.5      ✓ yardstick 0.0.5 
#> ✓ purrr     0.3.3
#> ── Conflicts ────────────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard()  masks scales::discard()
#> x dplyr::filter()   masks stats::filter()
#> x dplyr::lag()      masks stats::lag()
#> x ggplot2::margin() masks dials::margin()
#> x recipes::step()   masks stats::step()
df<-structure(list(a = c(1379405931, 732812609, 18614430, 1961678341, 
2362202769, 55687714, 72044715, 236503454, 61988734, 2524712675, 
98081131, 1366513385, 48203585, 697397991, 28132854), b = structure(c(1L, 
6L, 2L, 5L, 7L, 8L, 8L, 1L, 3L, 4L, 3L, 5L, 7L, 2L, 2L), .Label = c("CA", 
"IA", "IL", "LA", "MA", "MN", "TX", "WI"), class = "factor"), 
    c = structure(c(2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 
    2L, 2L, 2L, 1L), .Label = c("R", "U"), class = "factor"), 
    d = structure(c(3L, 3L, 1L, 3L, 3L, 1L, 1L, 3L, 1L, 3L, 1L, 
    3L, 2L, 3L, 1L), .Label = c("CAH", "LTCH", "STH"), class = "factor"), 
    e = structure(c(3L, 2L, 3L, 3L, 1L, 3L, 3L, 3L, 2L, 4L, 2L, 
    2L, 3L, 3L, 3L), .Label = c("cancer", "general long term", 
    "psychiatric", "rehabilitation"), class = "factor")), row.names = c(NA, 
-15L), class = c("tbl_df", "tbl", "data.frame"))


library(tidymodels)
library(ranger)

rf_spec <- rand_forest(mode = 'regression') %>% 
  set_engine('ranger')

rf_folds <- vfold_cv(df, strata = c)
fit_resamples(recipe(a ~ ., data = df),  rf_spec, rf_folds)
#> #  10-fold cross-validation using stratification 
#> # A tibble: 9 x 4
#>   splits         id    .metrics         .notes          
#> * <list>         <chr> <list>           <list>          
#> 1 <split [13/2]> Fold1 <tibble [2 × 3]> <tibble [0 × 1]>
#> 2 <split [13/2]> Fold2 <tibble [2 × 3]> <tibble [0 × 1]>
#> 3 <split [13/2]> Fold3 <tibble [2 × 3]> <tibble [0 × 1]>
#> 4 <split [13/2]> Fold4 <tibble [2 × 3]> <tibble [0 × 1]>
#> 5 <split [13/2]> Fold5 <tibble [2 × 3]> <tibble [0 × 1]>
#> 6 <split [13/2]> Fold6 <tibble [2 × 3]> <tibble [0 × 1]>
#> 7 <split [14/1]> Fold7 <tibble [2 × 3]> <tibble [0 × 1]>
#> 8 <split [14/1]> Fold8 <tibble [2 × 3]> <tibble [0 × 1]>
#> 9 <split [14/1]> Fold9 <tibble [2 × 3]> <tibble [0 × 1]>

# FYI `tune` 0.0.2 will require a different argument order: 
# rf_spec %>% fit_resamples(recipe(a ~ ., data = df), rf_folds)

^{Created on 2020-03-20 by the reprex package (v0.3.0)}

fit_resamples with ranger package fails

2 Answers2