0

If a categorical variable has more than 2 values (like marital status= single/married/widowed/separated/divorced), then I need to create N dummies, one for each of the possible levels. This is done using step_dummy(one_hot = TRUE).

However, if the category is binary (pokemon_fan = "yes"/"no") then I only need to create a single dummy called "pokemon_fan_yes". This is done using step_dummy(one_hot = FALSE).

Is it possible for step_dummy to count the number of levels and proceed differently depending on that number?

thanks.

Zoltan
  • 760
  • 4
  • 15
  • According to the help page, it should do it automatically, i.e. for a binary var with yes/no, specifying one_hot = TRUE, will create C-1 levels. So for a binary variable it will create one var, for a categorigal var with three levels it will create 2 dummies. Which also shows that for a binary var, the dummy coding is not necessary, because by definition it already contains the info about yes/non-yes. – deschen Feb 23 '22 at 15:39
  • from the help page: "one_hot: A logical. For C levels, should C dummy variables be created rather than C-1?". If TRUE then it creates C levels, not C-1. This approach is not used for GLM, but useful for XGBoost. The problem is that if you set it to true, it will create 2 dummies for all your TRUE/FALSE categorical variables, which isnt not useful for GLM nor XGBoost. – Zoltan Feb 23 '22 at 15:51
  • Categorical variables with more than 2 values don't necessarily need one hot encoding to capture all the information (e.g., if single/married/widowed/separated == 0, then implicitly divorced == 1). To do what you want, however, I think you'd just need two calls to `step_dummy()` - one for cols with `one_hot = TRUE` and one for cols with `one_hot = FALSE` – Mark Rieke Feb 23 '22 at 16:55
  • yes, implicitly divorced ==1, but the decision tree will have a much easier time finding out that divorcees behave differently if they have their own dummy variable, rather than having to split on married ==0 then single ==0 then widowed==0 then separated==0. I agree with the two calls to step_dummy, the problem is that you have to know before hand which variables are binary and which arent. This is something I'd rather have the step figure by itself. I'd rather have a step_dummy_xgb(all_nominal()) function – Zoltan Feb 23 '22 at 18:03

1 Answers1

1

There is no automatic way to do this within recipes itself, but I think you can create a function that will handle this for you, something like this:

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step

data(crickets, package = "modeldata")

levels_more_than <- function(vec, num = 2) {
  n_distinct(levels(vec)) > num
}

recipe(~ ., data = crickets) %>%
  step_dummy(species, one_hot = !! levels_more_than(crickets$species)) %>%
  prep() %>%
  bake(new_data = NULL)
#> # A tibble: 31 × 3
#>     temp  rate species_O..niveus
#>    <dbl> <dbl>             <dbl>
#>  1  20.8  67.9                 0
#>  2  20.8  65.1                 0
#>  3  24    77.3                 0
#>  4  24    78.7                 0
#>  5  24    79.4                 0
#>  6  24    80.4                 0
#>  7  26.2  85.8                 0
#>  8  26.2  86.6                 0
#>  9  26.2  87.5                 0
#> 10  26.2  89.1                 0
#> # … with 21 more rows

recipe(~ ., data = iris) %>%
  step_dummy(Species, one_hot = !! levels_more_than(iris$Species)) %>%
  prep() %>%
  bake(new_data = NULL)
#> # A tibble: 150 × 7
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species_setosa
#>           <dbl>       <dbl>        <dbl>       <dbl>          <dbl>
#>  1          5.1         3.5          1.4         0.2              1
#>  2          4.9         3            1.4         0.2              1
#>  3          4.7         3.2          1.3         0.2              1
#>  4          4.6         3.1          1.5         0.2              1
#>  5          5           3.6          1.4         0.2              1
#>  6          5.4         3.9          1.7         0.4              1
#>  7          4.6         3.4          1.4         0.3              1
#>  8          5           3.4          1.5         0.2              1
#>  9          4.4         2.9          1.4         0.2              1
#> 10          4.9         3.1          1.5         0.1              1
#> # … with 140 more rows, and 2 more variables: Species_versicolor <dbl>,
#> #   Species_virginica <dbl>

Created on 2022-02-23 by the reprex package (v2.0.1)

Here are some tips for using not-quite-standard selectors in recipes.

Julia Silge
  • 10,848
  • 2
  • 40
  • 48
  • 1
    Woah! I wasn't aware you could pass a vector to one_hot. This totally works. Thank you so much for your help and the recommended reading. For my future reference, here is code that uses all_nominal() and demonstrates what happens with a 2-levels and a 3-levels factor in the same data: recipe(~ ., data = iris %>% mutate(mybinary= factor(round(runif(nrow(iris)))))) %>% step_dummy(all_nominal(), one_hot = !! levels_more_than(all_nominal())) %>% prep() %>% bake(new_data = NULL) – Zoltan Feb 24 '22 at 09:09