Trying to run my first LASSO model and running into a few issues. I have a medical dataset where I am trying to predict a dichotomous outcome (the disease) from about 60 predictors. I get as far as tuning the grid before I get the error "All columns selected for the step should be numeric" despite having converted them all to dummy variables already during the recipe stage. I have reduced the amount of predictors to see if that changes anything but it doesn't seem to fix it. The outcome is uncommon and is seen in about 3% of cases so I don't know is this affecting anything. Code as follows
Splitting into testing and training data and stratifying by disease
set.seed(123)
df_split <- initial_split(df, strata = disease)
df_train <- training(df_split)
df_test <- testing(df_split)
Creating validation set
set.seed(234)
validation_set <- validation_split(df_train,
strata = dfPyVAN,
prop = 0.8)
Building the model
df_model <-
logistic_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
Creating the recipe
df_recipe <-
recipe(dfPyVAN ~ ., data = df_train) %>%
step_medianimpute(all_predictors()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
Create workflow
df_workflow <-
workflow() %>%
add_model(df_model) %>%
add_recipe(df_recipe)
Grid of penalty values to tune
df_reg_grid <- tibble(penalty = 10^seq(-4, -1, length.out = 30))
Train and tune the model - this is the step where it breaks down and I get the constant error
df_res <-
df_workflow %>%
tune_grid(validation_set,
grid = df_reg_grid,
control = control_grid(save_pred = TRUE),
metrics = metric_set(roc_auc))
I have tried multiple variations with the same result - would be very grateful if anyone could offer any help,
Many thanks