Julia bet me to it so she gets the karma. I had the same answer (I do what she does but slower):
This is kind of a bug and we've been working to the best way to make it not error. It's complicated. Let me explain.
ranger
is one of a few R packages whose formula method does not create dummy variables (sensibly since it does not need them).
The infrastructure in tune
uses the workflows
package to process the formula then hand the resulting data over to ranger
. By default, workflows does create dummy variables and, since some of your factor levels are not valid R column names (e.g. "general long term"
), ranger()
kicks an error.
(I know that you didn't use a workflow, but that is what happens under the hood).
We are working on the best thing to do here since most users don't know that many tree-based model packages do not produce dummy variables. To make it more complex, parsnip
doesn't use workflows (yet) and did not have given you an error.
Solution for now
Use a simple recipe instead of a formula:
library(tidymodels)
#> ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────── tidymodels 0.1.0 ──
#> ✓ broom 0.5.4 ✓ recipes 0.1.10
#> ✓ dials 0.0.4 ✓ rsample 0.0.5
#> ✓ dplyr 0.8.5 ✓ tibble 2.1.3
#> ✓ ggplot2 3.3.0 ✓ tune 0.0.1
#> ✓ infer 0.5.1 ✓ workflows 0.1.0
#> ✓ parsnip 0.0.5 ✓ yardstick 0.0.5
#> ✓ purrr 0.3.3
#> ── Conflicts ────────────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> x ggplot2::margin() masks dials::margin()
#> x recipes::step() masks stats::step()
df<-structure(list(a = c(1379405931, 732812609, 18614430, 1961678341,
2362202769, 55687714, 72044715, 236503454, 61988734, 2524712675,
98081131, 1366513385, 48203585, 697397991, 28132854), b = structure(c(1L,
6L, 2L, 5L, 7L, 8L, 8L, 1L, 3L, 4L, 3L, 5L, 7L, 2L, 2L), .Label = c("CA",
"IA", "IL", "LA", "MA", "MN", "TX", "WI"), class = "factor"),
c = structure(c(2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L,
2L, 2L, 2L, 1L), .Label = c("R", "U"), class = "factor"),
d = structure(c(3L, 3L, 1L, 3L, 3L, 1L, 1L, 3L, 1L, 3L, 1L,
3L, 2L, 3L, 1L), .Label = c("CAH", "LTCH", "STH"), class = "factor"),
e = structure(c(3L, 2L, 3L, 3L, 1L, 3L, 3L, 3L, 2L, 4L, 2L,
2L, 3L, 3L, 3L), .Label = c("cancer", "general long term",
"psychiatric", "rehabilitation"), class = "factor")), row.names = c(NA,
-15L), class = c("tbl_df", "tbl", "data.frame"))
library(tidymodels)
library(ranger)
rf_spec <- rand_forest(mode = 'regression') %>%
set_engine('ranger')
rf_folds <- vfold_cv(df, strata = c)
fit_resamples(recipe(a ~ ., data = df), rf_spec, rf_folds)
#> # 10-fold cross-validation using stratification
#> # A tibble: 9 x 4
#> splits id .metrics .notes
#> * <list> <chr> <list> <list>
#> 1 <split [13/2]> Fold1 <tibble [2 × 3]> <tibble [0 × 1]>
#> 2 <split [13/2]> Fold2 <tibble [2 × 3]> <tibble [0 × 1]>
#> 3 <split [13/2]> Fold3 <tibble [2 × 3]> <tibble [0 × 1]>
#> 4 <split [13/2]> Fold4 <tibble [2 × 3]> <tibble [0 × 1]>
#> 5 <split [13/2]> Fold5 <tibble [2 × 3]> <tibble [0 × 1]>
#> 6 <split [13/2]> Fold6 <tibble [2 × 3]> <tibble [0 × 1]>
#> 7 <split [14/1]> Fold7 <tibble [2 × 3]> <tibble [0 × 1]>
#> 8 <split [14/1]> Fold8 <tibble [2 × 3]> <tibble [0 × 1]>
#> 9 <split [14/1]> Fold9 <tibble [2 × 3]> <tibble [0 × 1]>
# FYI `tune` 0.0.2 will require a different argument order:
# rf_spec %>% fit_resamples(recipe(a ~ ., data = df), rf_folds)
Created on 2020-03-20 by the reprex package (v0.3.0)