3

I am building a logistic regression model with an outcome variable with 2 categories: a_category / z_category, and I have the following questions:

  1. I am interested in predicting "z_category" using the independent variables, therefore my reference category should be "a_category". "a_category" is the first category in the variable, therefore it´s not necessary to relevel my outcome category and this could be the code:

Splits:

splits<- initial_split(df1, strata = c(outcome), prop = 3/4)
training_set <- training(splits)
test_set  <- testing(splits)

Recipe:

      glm_rec <-
      recipe(outcome~., data=training_set) %>% 
      step_zv(all_predictors()) %>% 
      step_normalize(all_predictors()) %>% 
      step_dummy(all_nominal(), -all_outcomes())

Model spec:

glm_spec <- 
  logistic_reg() %>% 
  set_engine("glm") 

Workflow:

glm_final_wf <- 
  workflow() %>% 
  add_model(glm_spec) %>% 
  add_recipe(glm_rec)

Am I right?

  1. Internal validation and roc curves: I am using event_level = "second" to calculate metrics and roc curve using yardstick functions:
# metrics
glm_internalval_res <- glm_final_wf %>% 
  fit_resamples(
    resamples = vfold_cv(training_set, 
                                  v= 10, 
                                  repeats = 2, 
                                  strata = outcome),
    control = control_resamples(save_pred = TRUE, event_level = "second"),
    metrics = metric_set(
      yardstick::roc_auc, 
      yardstick::accuracy,
      yardstick::sens, 
      yardstick::spec,
      yardstick::precision, 
      yardstick::ppv,
      yardstick::npv)
      )

# ROC curve
glm_internalval_res %>%
  collect_predictions()%>%
  group_by(id, id2) %>%
  roc_curve(truth=outcome, 
            .pred_z_category,
            event_level = "second"
            ) %>%
   autoplot()

Am I right?

  1. External validation, last_fit. I cannot find how to set event_level="second". When I try:
glm_externalval_res <- 
  last_fit(glm_final_wf, 
           splits,
           metrics = metric_set(yardstick::roc_auc, 
      yardstick::accuracy,
      yardstick::sens, 
      yardstick::spec,
      yardstick::precision, 
      yardstick::ppv,
      yardstick::npv)
  )

Using this chunk, the metrics are referred to the first category "a_category", and I think this is not correct.

I am wondering how to indicate to last_fit that my category of interest is "z_category". I coudn´t find an answer in the package information.

Thanks.

Rafael.

2 Answers2

2

The easiest thing to do is definitely to rename your levels so the one that you are interested in is first. However, if that is not what you want to do, then you need to make a metric with an option and put it into a metric_set(). The procedure for this is outlined in the docs for metric_set().

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip

roc_auc_with_event_level <- function(data, truth, ..., na_rm = TRUE) {
   roc_auc(
      data = data,
      truth = !! rlang::enquo(truth),
      ...,
      na_rm = na_rm,
      # set event level
      event_level = "second"
   )
}

roc_auc_with_event_level <- new_prob_metric(roc_auc_with_event_level, "maximize")

ms <- metric_set(accuracy, roc_auc_with_event_level)
ms
#> # A tibble: 2 × 3
#>   metric                   class        direction
#>   <chr>                    <chr>        <chr>    
#> 1 accuracy                 class_metric maximize 
#> 2 roc_auc_with_event_level prob_metric  maximize

Created on 2021-08-01 by the reprex package (v2.0.0)

Now you can use this metric set ms in tuning functions like last_fit(metrics = ms).

Julia Silge
  • 10,848
  • 2
  • 40
  • 48
  • Hi Julia, 2 comments: 1. I tried renaming and also with step_relevel(), setting my category of interest as the first category or reference level . Results with this approach: a. Metrics issue solved b. Problem: the logistic regression model will predict the second category (OR will be used to predict the second category). And I need the OR to predict the first category. Any other option to fix this? 2. I got a message error running your suggestion: .notes column: Internal: Error: in metric: ‘ roc_auc_with_event_level’. No valid variables provided to ‘…’. Any help? Thanks. – Rafael Santamaria Aug 03 '21 at 11:56
  • Oh, you're right; sorry about that. We just [started an issue to add a `control` argument here](https://github.com/tidymodels/tune/issues/399). I think the best thing for you to do in the meantime is to not use `last_fit()` but to manually `fit()` one time on the training data and `predict()` on the testing data, then use the wrapped metric directly as shown in the docs. – Julia Silge Aug 03 '21 at 17:42
  • ok, I see, thanks. how to preprocess the data using a recipe before manually fit() and predict()? Should I prep() and bake() training and testing data and then fit() and predict() with the baked data? I did it but the confusion matrix I got manually is different from the confusion matrix I got using last_fit(). Is this possible? Shouldn´t the model and the predicitions be the same in both cases? Thanks. – Rafael Santamaria Aug 04 '21 at 21:48
  • You can `fit()` your whole workflow `glm_final_wf` one time on the training data and then `predict()` from that fitted workflow. That way you estimate both preprocessing and model parameters from the training data and can apply them to new data at prediction time. – Julia Silge Aug 04 '21 at 22:23
1

One options is to set the global option for the 2nd event:

Pre 0.0.7

options(yardstick.event_first = FALSE)

Post 0.0.7:

options(yardstick.event_level = 'second')

Ryan John
  • 1,410
  • 1
  • 15
  • 23