3

Trying to run my first LASSO model and running into a few issues. I have a medical dataset where I am trying to predict a dichotomous outcome (the disease) from about 60 predictors. I get as far as tuning the grid before I get the error "All columns selected for the step should be numeric" despite having converted them all to dummy variables already during the recipe stage. I have reduced the amount of predictors to see if that changes anything but it doesn't seem to fix it. The outcome is uncommon and is seen in about 3% of cases so I don't know is this affecting anything. Code as follows

Splitting into testing and training data and stratifying by disease

set.seed(123)
df_split <- initial_split(df, strata = disease)
df_train <- training(df_split)
df_test <- testing(df_split)

Creating validation set

set.seed(234)
validation_set <- validation_split(df_train,
                                   strata = dfPyVAN,
                                   prop = 0.8)

Building the model

df_model <- 
  logistic_reg(penalty = tune(), mixture = 1) %>% 
  set_engine("glmnet")

Creating the recipe

df_recipe <- 
  recipe(dfPyVAN ~ ., data = df_train) %>% 
  step_medianimpute(all_predictors()) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors())

Create workflow

df_workflow <- 
  workflow() %>% 
  add_model(df_model) %>% 
  add_recipe(df_recipe)

Grid of penalty values to tune

df_reg_grid <- tibble(penalty = 10^seq(-4, -1, length.out = 30))

Train and tune the model - this is the step where it breaks down and I get the constant error

df_res <- 
  df_workflow %>% 
  tune_grid(validation_set,
            grid = df_reg_grid,
            control = control_grid(save_pred = TRUE),
            metrics = metric_set(roc_auc))

I have tried multiple variations with the same result - would be very grateful if anyone could offer any help,

Many thanks

Phil
  • 7,287
  • 3
  • 36
  • 66
Ryan
  • 51
  • 5

1 Answers1

4

The error you are getting is coming from step_medianimpute(). step_medianimpute() requires all the variables to be numeric but it is being passed factor variables with all_predictors().

One way to fix this problem is by rearranging your recipe to create dummy variables before you impute.

library(recipes)
library(modeldata)
data(ames)

df_recipe <- 
  recipe(Central_Air ~ ., data = ames) %>% 
  step_medianimpute(all_predictors()) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors())

prep(df_recipe)
#> Error: All columns selected for the step should be numeric

df_recipe <- 
  recipe(Central_Air ~ ., data = ames) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_medianimpute(all_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors())

prep(df_recipe)
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor         73
#> 
#> Training data contained 2930 data points and no missing data.
#> 
#> Operations:
#> 
#> Dummy variables from MS_SubClass, MS_Zoning, Street, Alley, ... [trained]
#> Median Imputation for Lot_Frontage, Lot_Area, ... [trained]
#> Zero variance filter removed 2 items [trained]
#> Centering and scaling for Lot_Frontage, Lot_Area, ... [trained]

Created on 2021-04-27 by the reprex package (v1.0.0)

EmilHvitfeldt
  • 2,555
  • 1
  • 9
  • 12