How to make predictions in tidymodels R when feature selection has been applied to the model

Question

I have two datasets, a training and test dataset, and I am creating an SVM using the training dataset, with the tidymodels package on R. As part of the SVM workflow, I am doing feature selection to chose the 5 best performing features. I am then trying to test this SVM using the test dataset. However, I am getting a "The following required columns are missing" error when I am trying to predict classifications of the test dataset, despite the variables in the test dataset matching the model predictors.

Note that I do the feature selection using step_select_roc, with top_p selecting the 5 best performing features. I have created a replicable example:

library(tidymodels)
#remotes::install_github("stevenpawley/recipesSelection")
library(recipeselectors)

library(mlbench)
data(Ionosphere)

# preprocess dataset
Ionosphere <- Ionosphere %>% select(-V1, -V2)

# split into training and test data
ion_split <- initial_split(Ionosphere, prop = 3/5)

ion_train <- training(ion_split)
ion_test <- testing(ion_split) 

# make a recipe - note the step_select_roc function, which will select the 5 
iono_rec <-
  recipe(Class ~ ., data = ion_train)  %>%
  step_zv(all_predictors()) %>% 
  step_lincomb(all_numeric()) %>%
  step_select_roc(all_predictors(), outcome = "Class", top_p = 5)

# build the model and workflow
svm_mod <-
  svm_rbf(cost = tune(), rbf_sigma = tune()) %>%
  set_mode("classification") %>%
  set_engine("kernlab")

svm_workflow <- 
      workflow() %>%
      add_recipe(iono_rec) %>%
      add_model(svm_mod)

# run model tuning
set.seed(35)
recipe_res <-
  svm_workflow %>% 
  tune_grid(
    resamples = bootstraps(ion_train, times = 2),
    metrics = metric_set(roc_auc),
    control = control_grid(verbose = TRUE, save_pred = TRUE)
  )

# chose best model, finalise workflow
best_mod <- recipe_res %>% select_best("roc_auc")
final_wf <- finalize_workflow(svm_workflow, best_mod)
final_mod <- final_wf %>% fit(ion_train)

At this stage, I can do pull_workflow_mold to see that there are only 5 predictor variables:

pull_workflow_mold(final_mod)$predictor
# A tibble: 211 x 5
        V3     V7    V27     V31     V33
     <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
 1  0.995   0.834  0.411  0.423   0.186 
 2  1      -0.109 -0.205 -0.166  -0.137 
 3  1       1      0.590  0.604   0.560 
 4  0.976   0.928  0.137 -0.0426 -0.138 
 5  0.964   1      0.576  0.451   0.389 
 6 -0.0186  0      0.206  0.166  -0.0821
 7  1       1      1      1       1     
 8  1       1.00   0.762  0.687   0.647 
 9  1       0.855  1      1       1     
10  1       1      1      1       1     
# … with 201 more rows

Now if I subset my test data to only those predictors in the model, and then try and use predict, I get an error:

ion_test <- testing(ion_split) %>% select(V3, V7, V27, V31, V33)

predict_res <- predict(
        final_mod,
        ion_test,
        type = "prob")
    
Error: The following required columns are missing: 'V4', 'V5', 'V6', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V28', 'V29', 'V30', 'V32', 'V34'.

Can someone please advise why this problem is happening, and how to avoid it? Thank you.

In layman's terms: you are training a model with a set of predictors. All of these predictors are part of the model - even if their "impact" on the model outcome might be low. If you want to predict new cases with this model, we of course need all the variables from the original model in the test data set. What you probably want to do is: 1. train your larger model. 2. select the best model with only a suset of predictors. 3. Create a new data set with only this subset of predictors and train a new model. 4. Apply your test data (with only this subset of variables) to this new model. — deschen, Apr 07 '21 at 19:56
The `final_mod` object is a workflow that is expecting all the predictors that were in `ion_train`; you'll need to pass in data that includes those variables, although it will not use them to make predictions. — Julia Silge, Apr 07 '21 at 21:55

scrameri · Answer 1 · 2021-09-14T22:55:23.317

If you use tidymodels to fit and predict data, you need to provide the same variables in new_data as were used for model training.

This should fix your issue:

ion_test <- testing(ion_split) ## %>% select(V3, V7, V27, V31, V33) # don't select here!

predict_res <- predict(
        final_mod,
        new_data = ion_test,
        type = "prob")

predict_res
# A tibble: 141 × 2
   .pred_bad .pred_good
       <dbl>      <dbl>
 1    0.0217     0.978 
 2    0.908      0.0917
 3    0.961      0.0391
 4    0.0341     0.966 
 5    0.0641     0.936 
 6    0.957      0.0428
 7    0.0321     0.968 
 8    0.958      0.0424
 9    0.291      0.709 
10    0.0480     0.952 
# … with 131 more rows

Alternatively, you might want to repeat the fitting procedure using only the five selected variables in the recipe, and then predict the new data with the same variables selected. However, I feel that this goes a bit against the tidy philophophy of tidymodels, although it will give you a smaller object to save on disk.

Also, note that I got a warning about deprecated use of pull_* functions in your original code. I replaced

pull_workflow_mold(final_mod)$predictor

with

extract_mold(final_mod)$predictor
# A tibble: 210 × 5
      V3      V4    V5    V7   V27
   <dbl>   <dbl> <dbl> <dbl> <dbl>
 1 0.724 -0.0108 0.797 0.8   0.780
 2 0.599  0.147  0.699 0.851 0.614
 3 0.495  0.0971 0.296 0.350 0.365
 4 0      0      0     0     1    
 5 0.947  0.287  0.726 0.476 0.161
 6 0.923  0.0780 0.927 0.897 0.188
 7 0.675  0.0453 0.770 0.774 0.739
 8 1     -0.0373 1     0.996 0.832
 9 0.749  0.0255 0.990 0.759 0.823
10 0.882 -0.146  0.934 0.921 0.568
# … with 200 more rows

Also note that I got different chosen predictors.

> sessionInfo()

R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] kernlab_0.9-29        vctrs_0.3.8           rlang_0.4.11         
 [4] recipeselectors_0.0.1 mlbench_2.1-3         yardstick_0.0.8      
 [7] workflowsets_0.1.0    workflows_0.2.3       tune_0.1.6           
[10] tidyr_1.1.3           tibble_3.1.4          rsample_0.1.0        
[13] recipes_0.1.16        purrr_0.3.4           parsnip_0.1.7        
[16] modeldata_0.1.1       infer_1.0.0           dplyr_1.0.7          
[19] dials_0.0.10          scales_1.1.1          broom_0.7.9          
[22] tidymodels_0.1.3      ggplot2_3.3.5

How to make predictions in tidymodels R when feature selection has been applied to the model

1 Answers1