Why does an "id variable" in tidymodels/recipes play a predictor role?

Question

This is the same issue as Predict with step_naomit and retain ID using tidymodels , but even though there is an accepted answer, the OP's last comment states the issue the "id variable" is being used as a predictor, as can be seen when looking at model$fit$variable.importance.

I have a dataset with "id variables" I would like to keep. I thought I would be able to achieve this with a recipe() specification.

library(tidymodels)

# label is an identifier variable I want to keep even though it's not
# a predictor
df <- tibble(label = 1:50, 
             x = rnorm(50, 0, 5), 
             f = factor(sample(c('a', 'b', 'c'), 50, replace = TRUE)),
             y = factor(sample(c('Y', 'N'), 50, replace = TRUE)) )

df_split <- initial_split(df, prop = 0.70)

# Make up any recipe: just note I specify 'label' as "id variable"
rec <- recipe(training(df_split)) %>% 
  update_role(label, new_role = "id variable") %>% 
  update_role(y, new_role = "outcome") %>% 
  update_role(x, new_role = "predictor") %>% 
  update_role(f, new_role = "predictor") %>% 
  step_corr(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_predictors(),-all_numeric()) %>% 
  step_meanimpute(all_numeric(), -all_outcomes())

train_juiced <- prep(rec, training(df_split)) %>% juice()

logit_fit <- logistic_reg(mode = "classification") %>%
  set_engine(engine = "glm") %>% 
  fit(y ~ ., data = train_juiced)

# Why is label a variable in the model ?
logit_fit[['fit']][['coefficients']]
#> (Intercept)       label           x         f_b         f_c 
#>  1.03664140 -0.01405316  0.22357266 -1.80701531 -1.66285399

^{Created on 2020-01-27 by the reprex package (v0.3.0)}

But even though I did specify label was an id variable, it is being used as a predictor. So maybe I can use the specific terms I want in the formula and specifically add label as an id variable.

rec <- recipe(training(df_split), y ~ x + f) %>% 
  update_role(label, new_role = "id variable") %>% 
  step_corr(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_predictors(),-all_numeric()) %>% 
  step_meanimpute(all_numeric(), -all_outcomes())
#> Error in .f(.x[[i]], ...): object 'label' not found

^{Created on 2020-01-27 by the reprex package (v0.3.0)}

I can try not mentioning label

rec <- recipe(training(df_split), y ~ x + f) %>% 
  step_corr(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_predictors(),-all_numeric()) %>% 
  step_meanimpute(all_numeric(), -all_outcomes())


train_juiced <- prep(rec, training(df_split)) %>% juice()

logit_fit <- logistic_reg(mode = "classification") %>%
  set_engine(engine = "glm") %>% 
  fit(y ~ ., data = train_juiced)

# Why is label a variable in the model ?
logit_fit[['fit']][['coefficients']]
#> (Intercept)           x         f_b         f_c 
#> -0.98950228  0.03734093  0.98945339  1.27014824

train_juiced
#> # A tibble: 35 x 4
#>          x y       f_b   f_c
#>      <dbl> <fct> <dbl> <dbl>
#>  1 -0.928  Y         1     0
#>  2  4.54   N         0     0
#>  3 -1.14   N         1     0
#>  4 -5.19   N         1     0
#>  5 -4.79   N         0     0
#>  6 -6.00   N         0     0
#>  7  3.83   N         0     1
#>  8 -8.66   Y         1     0
#>  9 -0.0849 Y         1     0
#> 10 -3.57   Y         0     1
#> # ... with 25 more rows

^{Created on 2020-01-27 by the reprex package (v0.3.0)}

OK, so the model works, but I have lost my label.
How should I do this ?

score 11 · Accepted Answer · answered Feb 16 '20 at 01:10

The main issue/conceptual problem you are running into is that once you juice() the recipe, it is just data, i.e. just literally a dataframe. When you use that to fit a model, there's no way for the model to know that some of the variables had special roles.

library(tidymodels)

# label is an identifier variable to keep even though it's not a predictor
df <- tibble(label = 1:50, 
             x = rnorm(50, 0, 5), 
             f = factor(sample(c('a', 'b', 'c'), 50, replace = TRUE)),
             y = factor(sample(c('Y', 'N'), 50, replace = TRUE)) )

df_split <- initial_split(df, prop = 0.70)

rec <- recipe(y ~ ., training(df_split)) %>% 
  update_role(label, new_role = "id variable") %>% 
  step_corr(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_predictors(),-all_numeric()) %>% 
  step_meanimpute(all_numeric(), -all_outcomes()) %>%
  prep()

train_juiced <- juice(rec)
train_juiced
#> # A tibble: 35 x 5
#>    label     x y       f_b   f_c
#>    <int> <dbl> <fct> <dbl> <dbl>
#>  1     1  1.80 N         1     0
#>  2     3  1.45 N         0     0
#>  3     5 -5.00 N         0     0
#>  4     6 -4.15 N         1     0
#>  5     7  1.37 Y         0     1
#>  6     8  1.62 Y         0     1
#>  7    10 -1.77 Y         1     0
#>  8    11 -3.15 N         0     1
#>  9    12 -2.02 Y         0     1
#> 10    13  2.65 Y         0     1
#> # … with 25 more rows

Notice that train_juiced is just literally a regular tibble. If you train a model on this tibble using fit(), it won't know anything about the recipe used to transform the data.

The tidymodels framework does have a way to train models using the role information from the recipe. Probably the easiest way to do that is using workflows.

logit_spec <- logistic_reg(mode = "classification") %>%
  set_engine(engine = "glm") 

wf <- workflow() %>%
  add_model(logit_spec) %>%
  add_recipe(rec)

logit_fit <- fit(wf, training(df_split))

# No more label in the model
logit_fit
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: logistic_reg()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> 3 Recipe Steps
#> 
#> ● step_corr()
#> ● step_dummy()
#> ● step_meanimpute()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> 
#> Call:  stats::glm(formula = formula, family = stats::binomial, data = data)
#> 
#> Coefficients:
#> (Intercept)            x          f_b          f_c  
#>     0.42331     -0.04234     -0.04991      0.64728  
#> 
#> Degrees of Freedom: 34 Total (i.e. Null);  31 Residual
#> Null Deviance:       45 
#> Residual Deviance: 44.41     AIC: 52.41

^{Created on 2020-02-15 by the reprex package (v0.3.0)}

No more labels in the model!

Why does an "id variable" in tidymodels/recipes play a predictor role?

1 Answers1