I have two datasets, a training and test dataset, and I am creating an SVM using the training dataset, with the tidymodels package on R. As part of the SVM workflow, I am doing feature selection to chose the 5 best performing features. I am then trying to test this SVM using the test dataset. However, I am getting a "The following required columns are missing"
error when I am trying to predict classifications of the test dataset, despite the variables in the test dataset matching the model predictors.
Note that I do the feature selection using step_select_roc, with top_p selecting the 5 best performing features. I have created a replicable example:
library(tidymodels)
#remotes::install_github("stevenpawley/recipesSelection")
library(recipeselectors)
library(mlbench)
data(Ionosphere)
# preprocess dataset
Ionosphere <- Ionosphere %>% select(-V1, -V2)
# split into training and test data
ion_split <- initial_split(Ionosphere, prop = 3/5)
ion_train <- training(ion_split)
ion_test <- testing(ion_split)
# make a recipe - note the step_select_roc function, which will select the 5
iono_rec <-
recipe(Class ~ ., data = ion_train) %>%
step_zv(all_predictors()) %>%
step_lincomb(all_numeric()) %>%
step_select_roc(all_predictors(), outcome = "Class", top_p = 5)
# build the model and workflow
svm_mod <-
svm_rbf(cost = tune(), rbf_sigma = tune()) %>%
set_mode("classification") %>%
set_engine("kernlab")
svm_workflow <-
workflow() %>%
add_recipe(iono_rec) %>%
add_model(svm_mod)
# run model tuning
set.seed(35)
recipe_res <-
svm_workflow %>%
tune_grid(
resamples = bootstraps(ion_train, times = 2),
metrics = metric_set(roc_auc),
control = control_grid(verbose = TRUE, save_pred = TRUE)
)
# chose best model, finalise workflow
best_mod <- recipe_res %>% select_best("roc_auc")
final_wf <- finalize_workflow(svm_workflow, best_mod)
final_mod <- final_wf %>% fit(ion_train)
At this stage, I can do pull_workflow_mold
to see that there are only 5 predictor variables:
pull_workflow_mold(final_mod)$predictor
# A tibble: 211 x 5
V3 V7 V27 V31 V33
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.995 0.834 0.411 0.423 0.186
2 1 -0.109 -0.205 -0.166 -0.137
3 1 1 0.590 0.604 0.560
4 0.976 0.928 0.137 -0.0426 -0.138
5 0.964 1 0.576 0.451 0.389
6 -0.0186 0 0.206 0.166 -0.0821
7 1 1 1 1 1
8 1 1.00 0.762 0.687 0.647
9 1 0.855 1 1 1
10 1 1 1 1 1
# … with 201 more rows
Now if I subset my test data to only those predictors in the model, and then try and use predict, I get an error:
ion_test <- testing(ion_split) %>% select(V3, V7, V27, V31, V33)
predict_res <- predict(
final_mod,
ion_test,
type = "prob")
Error: The following required columns are missing: 'V4', 'V5', 'V6', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V28', 'V29', 'V30', 'V32', 'V34'.
Can someone please advise why this problem is happening, and how to avoid it? Thank you.