0

tidymodelRs out there!

The Goal: reuse a recipe object for multiple modeling types (logistic, RF, etc).

The Data: survey data that I have condensed to outcome_yes (numeric count of when something happened), outcome_no (numeric count of when that something didn't happen), total_tested (numeric count of times we wanted to see what would happen -- sum of yes and no outcome variables), cat_pred (yes/no categorical predictor), and num_pred (count potential outcome barriers).

What works:

  1. Using glm() with cbind():
example_data |>
  glm(
    cbind(outcome_yes, outcome_no) ~ cat_pred + num_pred,
    family = binomial(),
    data = _
  )
  1. Using events/trials syntax:
example_data |>
  glm(
    outcome_yes / total_tested ~ cat_pred + num_pred,
    family = binomial(),
    weights = total_tested,
    data = _
  )

The 2 above methods provide the same results and have also been replicated in SAS.

The ISSUE: Using either of the methods expressed above within a recipe() yields the error...

Error in `inline_check()`:
! No in-line functions should be used here; use steps to define baking actions.
Backtrace:
 1. recipes::recipe(...)
 2. recipes:::recipe.formula(...)
 3. recipes:::form2args(formula, data, ...)
 4. recipes:::inline_check(formula)

The next attempt involved using this tidymodels multivariate analysis example where the dependent variable was changed to outcome_yes + outcome_no. Successful until the fit() step shown below:

the_recipe <-
  recipe(
    outcome_yes + outcome_no ~ cat_pred + num_pred,
    family = binomial(),
    data = example_data
  ) |> 
  step_relevel(all_factor_predictors(), ref_level = 'No') |> 
  step_dummy(all_factor_predictors())

the_model <-
  logistic_reg() |> 
  set_engine('glm') |> 
  set_mode('classification')

the_workflow <-
  workflow() |> 
  add_recipe(the_recipe) |> 
  add_model(the_model)

the_workflow |> 
  fit(example_data)

The fit() also didn't like me:

Error in `check_outcome()`:
! For a classification model, the outcome should be a `factor`, not a `tbl_df`.
Backtrace:
  1. generics::fit(the_workflow, example_data)
  2. workflows:::fit.workflow(the_workflow, example_data)
  3. workflows::.fit_model(workflow, control)
  5. workflows:::fit.action_model(...)
  6. workflows:::fit_from_xy(spec, mold, case_weights, control_parsnip)
  8. parsnip::fit_xy.model_spec(...)
  9. parsnip:::xy_form(...)
 10. parsnip:::check_outcome(env$y, object)

Any help to get me past this would be a tremendous! Again, the goal is to create a recipe that can be used within the workflow and then incorporated with multiple modeling types. Thank you for reading this far and I appreciate your time.

KG

kyle-G
  • 51
  • 5
  • In addition to @hannahfrick's suggestion below, you may be interested in outlining how you would like to use this [in this GH issue](https://github.com/tidymodels/parsnip/issues/266). – Julia Silge Jun 29 '23 at 20:11

1 Answers1

1

A formula like cbind(count_yes, count_no) ~ . won't work with a parsnip model because tidymodels expects the outcome for a classification problem to be a factor. You can, however, use case weights. I'd recommend reading through https://www.tidyverse.org/blog/2022/05/case-weights/ for more context on case weights in the tidymodels framework.

library(tidymodels)

set.seed(403)
example_data <- tibble(
  outcome_yes = rpois(10, 4),
  outcome_no = rpois(10, 6),
  total_tested = outcome_yes + outcome_no,
  cat_pred = sample(c("yes", "no"), size = 10, replace = TRUE) %>% factor(),
  num_pred = rnorm(10)
)

# we are trying to do this via tidymodels
example_data |>
  glm(
    cbind(outcome_yes, outcome_no) ~ cat_pred + num_pred,
    family = binomial(),
    data = _
  )
#> 
#> Call:  glm(formula = cbind(outcome_yes, outcome_no) ~ cat_pred + num_pred, 
#>     family = binomial(), data = example_data)
#> 
#> Coefficients:
#> (Intercept)  cat_predyes     num_pred  
#>     -0.7720       0.3364       0.2043  
#> 
#> Degrees of Freedom: 9 Total (i.e. Null);  7 Residual
#> Null Deviance:       5.107 
#> Residual Deviance: 3.879     AIC: 35.94

the_model <-
  logistic_reg() |> 
  set_engine('glm') |> 
  set_mode('classification')

# tidymodels expects the outcome for classification models to be a factor
fit(the_model, 
    cbind(outcome_yes, outcome_no) ~ cat_pred + num_pred, 
    data = example_data)
#> Error in `check_outcome()`:
#> ! For a classification model, the outcome should be a `factor`, not a `matrix`.

# instead, use frequency weights
example_data_long <- example_data |>
  select(-total_tested) |>
  pivot_longer(cols = starts_with("outcome"), names_to = "outcome", values_to = "n") |>
  mutate(
    n = frequency_weights(n),
    outcome = stringr::str_remove(outcome, "outcome_") %>% factor(levels = c("yes", "no"))
  )
  
# direct fit via parsnip, no workflow
fit(the_model, 
    outcome ~ cat_pred + num_pred, 
    data = example_data_long,
    case_weights = example_data_long$n)
#> parsnip model object
#> 
#> 
#> Call:  stats::glm(formula = outcome ~ cat_pred + num_pred, family = stats::binomial, 
#>     data = data, weights = weights)
#> 
#> Coefficients:
#> (Intercept)  cat_predyes     num_pred  
#>      0.7720      -0.3364      -0.2043  
#> 
#> Degrees of Freedom: 19 Total (i.e. Null);  17 Residual
#> Null Deviance:       125.7 
#> Residual Deviance: 124.4     AIC: 130.4

# in a workflow with a recipe
the_recipe <- recipe(
    outcome ~ ., # recipes recognizes the case weights column and sets the appropriate role
    data = example_data_long
  ) |> 
  step_relevel(all_factor_predictors(), ref_level = "no") 

the_workflow <-
  workflow() |> 
  add_recipe(the_recipe) |> 
  add_model(the_model) |>
  add_case_weights(n) # add case weights to workflow

the_workflow |> 
  fit(example_data_long)
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: logistic_reg()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#> 
#> • step_relevel()
#> 
#> ── Case Weights ────────────────────────────────────────────────────────────────
#> n
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> 
#> Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data, 
#>     weights = weights)
#> 
#> Coefficients:
#> (Intercept)  cat_predyes     num_pred  
#>      0.7720      -0.3364      -0.2043  
#> 
#> Degrees of Freedom: 19 Total (i.e. Null);  17 Residual
#> Null Deviance:       125.7 
#> Residual Deviance: 124.4     AIC: 130.4

Created on 2023-06-28 with reprex v2.0.2