p-values for optimization on validation set MLR

Question

I have optimized some algorithms (in mlr3) on a validation set :

random forest
xgboost
svm

I have extracted the balanced accuracy of each algorithm but I'd like to know if there is a possibility to have the p-values of each prediction versus random prediction (featureless selection)... With a Friedman test I can compute it via benchmarking but it is only on my test set... Here is my code :

#Auto tuning Ranger
learner_ranger = lrn("classif.ranger", predict_type = "prob", num.trees = to_tune(1, 2000), mtry.ratio = to_tune(0, 1), sample.fraction = to_tune(1e-1, 1), importance = "impurity")

set.seed(1234)
at_ranger = auto_tuner(
  tuner= tnr("random_search"),
  learner = learner_ranger,
  resampling = resampling_inner,
  measure = msr("classif.bacc"),
  term_evals = 20,
  store_tuning_instance = TRUE,
  store_models = TRUE
)

set.seed(1234)
#Auto tuning knn
learner_kknn = lrn("classif.kknn", predict_type = "prob", k = to_tune(1, 30))

at_kknn = auto_tuner(
  tuner= tnr("random_search"),
  learner = learner_kknn,
  resampling = resampling_inner,
  measure = msr("classif.bacc"),
  term_evals = 20,
  store_tuning_instance = TRUE,
  store_models = TRUE
)

set.seed(1234)
#Auto tuning xgboost
learner_xgboost = lrn("classif.xgboost", predict_type = "prob", nrounds = to_tune(1, 5000), eta = to_tune(1e-4, 1, logscale = TRUE), subsample = to_tune(0.1,1), max_depth = to_tune(1,15), min_child_weight = to_tune(0, 7), colsample_bytree = to_tune(0,1), colsample_bylevel = to_tune(0,1), lambda = to_tune(1e-3, 1e3, logscale = TRUE), alpha = to_tune(1e-3, 1e3, logscale = TRUE))

at_xgboost = auto_tuner(
  tuner= tnr("random_search"),
  learner = learner_xgboost,
  resampling = resampling_inner,
  measure = msr("classif.bacc"),
  term_evals = 20,
  store_tuning_instance = TRUE,
  store_models = TRUE
)

set.seed(1234)
#Auto tuning rpart
learner_rpart = lrn("classif.rpart", cp = to_tune(0.0001,1),  minsplit = to_tune(1, 60),  maxdepth = to_tune(1, 30), minbucket = to_tune(1,60), predict_type = "prob")

at_rpart = auto_tuner(
 tuner= tnr("random_search"),
  learner = learner_rpart,
  resampling = resampling_inner,
  measure = msr("classif.bacc"),
  term_evals = 20,
  store_tuning_instance = TRUE,
  store_models = TRUE
)

set.seed(1234)
#Auto tuning svm
learner_svm = lrn("classif.svm", type = "C-classification", cost = to_tune(p_dbl(1e-5, 1e5, logscale = TRUE)), gamma = to_tune(p_dbl(1e-5, 1e5, logscale = TRUE)), kernel = to_tune(c("polynomial", "radial")), degree = to_tune(1, 4), predict_type = "prob")

at_svm = auto_tuner(
 tuner= tnr("random_search"),
  learner = learner_svm,
  resampling = resampling_inner,
  measure = msr("classif.bacc"),
  term_evals = 20,
  store_tuning_instance = TRUE,
  store_models = TRUE
)

For example with Xgboost :

learners = c(lrn("classif.featureless"), at_xgboost)

set.seed(1234)
design = benchmark_grid(
  tasks = list(task, task_imp, task_Emb, task_top, task_Pearson, task_IG), 
  learners = learners,
  resamplings = resampling_outer)
 bmr = benchmark(design,store_models = TRUE)
 
 results_xgboost <- bmr$aggregate(measures)
 print(results_xgboost)
 
archives = extract_inner_tuning_archives(bmr)
inner_learners = map(archives$resample_result, "learners")

best_bacc_xgboost <- aggregate(classif.bacc ~ task_id, data = archives, max)

I have extracted my best balanced accuracy result here but I would like to know, in comparaison with my feature selection technique, if there is a significative difference, or is it just the hazard ?

I had a beginning of response here with this topic : https://stats.stackexchange.com/questions/368176/how-to-determine-whether-a-classifier-is-significantly-better-than-random-guessi

"To assess whether one classifier is better than a "benchmark" one (e.g., one that always assigns a probability of 1 to the majority class and a probability of 0 to all others), you need to derive the distribution of the score of the benchmark. You can get this by bootstrapping: resample the test set, apply the benchmark to the resampled set, note the score. Repeat this many times. You now have the distribution you are looking for. Check how many of these resampled scores are smaller (since scores are better when smaller) than the score of your candidate classifier."

The great problem is: To create my test set, I manually customized my partition between the different sets of train, validation and test...

I will quickly be limited in terms of resampling because of this. My native data are very particular (I have patients, and each patient have different lesions, which contribute to the necessity of splitting the data to keep the lesions belonging to each patient together...)

EDIT : Based on the response of Lars, I will add the split flow of my work :

I have created three sets : -train -validation -test by splitting the datas with a customized resampling :

task = as_task_classif(data, target = "LesionResponse")

#Creation of the OUTER resampling via customization
resampling_outer = rsmp("custom")
resampling_outer$instantiate(task, train = list(train_rows_outer), test = list(test_rows))

#Creation of the INNER resampling via customization
resampling_inner = rsmp("custom")
resampling_inner$instantiate(task, train = list(train_rows), test = list(valid_rows))

#train_rows_inner divide my training set in two for the optimization of the models

#train_rows_outer divide my data set in two for : the training outer and the test.

Edit : Here is my strategy following the benchmarking of multiple learners. I had these results on the validation set.

The best results were obtained with Xgboost (absolutely no surprise here), but the "task_importance" was subject to overfitting as you can see so I checked and plotted the results on the test and it confirmed that this technique of feature selection was overfitting on the validation set...

resample_result_importance <- bmr$resample_results$resample_result[[6]]
resample_result_embarquee <- bmr$resample_results$resample_result[[4]]

prediction_xgboost_importance <- resample_result_importance$prediction()
prediction_xgboost_embarquee <- resample_result_embarquee$prediction()

pred_importance<-as.data.table(prediction_xgboost_importance)
pred_embarquee<-as.data.table(prediction_xgboost_embarquee)

levels(pred_importance$response) <- levels(pred_importance$truth)
levels(pred_embarquee$response) <- levels(pred_embarquee$truth)

# calculate confusion matrices
cm_importance <- mlr3measures::confusion_matrix(truth = pred_importance$truth, response = pred_importance$response, positive = "0")
cm_embarquee <- mlr3measures::confusion_matrix(truth = pred_embarquee$truth, response = pred_embarquee$response, positive = "0")

# print confusion matrices
print(cm_importance)
print(cm_embarquee)

# calculate AUC
auc_score_importance <- mlr3measures::auc(truth = pred_importance$truth, prob = pred_importance$`prob.0`, positive = "0")
auc_score_embarquee <- mlr3measures::auc(truth = pred_embarquee$truth, prob = pred_embarquee$`prob.0`, positive = "0")

# print AUC
print(auc_score_importance)
print(auc_score_embarquee)

# calculate p-values and confidence intervals
roc_obj_importance <- roc(pred_importance$truth, pred_importance$prob.1)
roc_obj_embarquee <- roc(pred_embarquee$truth, pred_embarquee$prob.1)

random_preds_importance <- rep(1, length(pred_importance$truth))
random_preds_embarquee <- rep(1, length(pred_embarquee$truth))

roc_obj_random_importance <- roc(pred_importance$truth, random_preds_importance)
roc_obj_random_embarquee <- roc(pred_embarquee$truth, random_preds_embarquee)

roc_test_result_importance <- roc.test(roc_obj_importance, roc_obj_random_importance)
roc_test_result_embarquee <- roc.test(roc_obj_embarquee, roc_obj_random_embarquee)

p_value_importance <- roc_test_result_importance$p.value
p_value_embarquee <- roc_test_result_embarquee$p.value

auc_ci_importance <- ci.auc(roc_obj_importance, method = "delong")
auc_ci_embarquee <- ci.auc(roc_obj_embarquee, method = "delong")

print(p_value_importance)
print(p_value_embarquee)

print(auc_ci_importance)
print(auc_ci_embarquee)

rocvals_importance <- data.frame(
  threshold = unique(pred_importance$`prob.1`),
  TPR = sapply(unique(pred_importance$`prob.1`), function(x) sum(pred_importance$truth[pred_importance$`prob.1` >= x] == "1") / sum(pred_importance$truth == "1")),
  FPR = sapply(unique(pred_importance$`prob.1`), function(x) sum(pred_importance$truth[pred_importance$`prob.1` >= x] == "0") / sum(pred_importance$truth == "0")),
  model = "importance"
)

rocvals_embarquee <- data.frame(
  threshold = unique(pred_embarquee$`prob.1`),
  TPR = sapply(unique(pred_embarquee$`prob.1`), function(x) sum(pred_embarquee$truth[pred_embarquee$`prob.1` >= x] == "1") / sum(pred_embarquee$truth == "1")),
  FPR = sapply(unique(pred_embarquee$`prob.1`), function(x) sum(pred_embarquee$truth[pred_embarquee$`prob.1` >= x] == "0") / sum(pred_embarquee$truth == "0")),
  model = "embarquee"
)

rocvals <- rbind(rocvals_importance, rocvals_embarquee)

caption_text <- paste(
  "Model importance: AUC = ", round(auc_score_importance, 3), 
  ", CI = [", round(auc_ci_importance[1], 3), ", ", round(auc_ci_importance[2], 3), "], p-value = ", round(p_value_importance, 3),
  "\nModel embarquee: AUC = ", round(auc_score_embarquee, 3),
  ", CI = [", round(auc_ci_embarquee[1], 3), ", ", round(auc_ci_embarquee[2], 3), "], p-value = ", round(p_value_embarquee, 3)
)

auc_labels <- data.frame(
  model = c("embarquee", "importance"),
  AUC = c(auc_score_embarquee, auc_score_importance),
  x = c(0.7, 0.7),
  y = c(0.5, 0.4)
)

# plot ROC curves
p <- ggplot(rocvals, aes(FPR, TPR, color = model)) +
  geom_point() +
  geom_line() +
  geom_abline(linetype = "dashed", color = "red") +
  geom_text(data = auc_labels, aes(x = x, y = y, label = paste("AUC =", round(AUC, 3)), color = model), hjust = 1, size = 4) +
  coord_fixed(xlim = c(0, 1), ylim = c(0, 1)) +
  labs(
    title = "ROC Curves for XGBoost on Test Data",
    x = "1 - Specificity (FPR)",
    y = "Sensitivity (TPR)",
    color = "Model",
    caption = caption_text  # add the caption
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5, color = "black"),
    axis.title.x = element_text(size = 12, face = "bold", color = "black"),
    axis.title.y = element_text(size = 12, face = "bold", color = "black"),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_line(color = "grey"),
    plot.caption = element_text(hjust = 0, face= "italic")  # style for the caption
  )

print(p)

score 1 · Accepted Answer · answered Jun 13 '23 at 15:31

1

mlr3 supports statistical tests that can be run directly on benchmark results, see the relevant section in the mlr3 book. This will also automatically correct for multiple testing, which doesn't happy if you extract p-values manually.

To get a bootstrapping estimate as you describe, all you need to do is set the resampling method for the benchmark design to bootstrapping with the desired number of iterations.

answered Jun 13 '23 at 15:31

Lars Kotthoff

107,425
16
204
204

Thanks for the answer, but I don't want the p-values for the benchmarking, since it's designed to compare my algorithms on my test set, I'm trying to find out what are the p-values of my optimized algorithms on my validation set since I want : 1) To compare them. 2) To assess if the results are significantly different From random predictions... – NDe Jun 13 '23 at 19:46
You can do this with a [custom resampling](https://mlr3book.mlr-org.com/performance.html#sec-resamp-custom). – Lars Kotthoff Jun 14 '23 at 08:34
I'll edit my post with your response and an explanation of my custom resampling, but I don't understand what you mean by this...How would you do this ? – NDe Jun 14 '23 at 09:57
You would define your existing splits as the custom resampling. Then this would be automatically picked up by the benchmarking (which you give your custom resampling to), which would do the statistical analysis on the predictions on the validation set. – Lars Kotthoff Jun 14 '23 at 10:31
When I look at the dictionary or in the mlr3 gallery, I don't find the code/object/function to extract such result. If you look at my benchmarking, how would you do this ? (thank you for your patienceLars, you're great help) – NDe Jun 14 '23 at 12:10
So you can either rely on the built-in functions to do the statistical comparisons, or extract the predictions from the benchmark result to run your own statistical test. If you do the latter, you have to take care to correct for multiple testing though, i.e. the p values can't be interpreted in the same way as if you compare only two things. – Lars Kotthoff Jun 15 '23 at 07:39
I only compare the best result of my tuning to a random prediction so I don't think I'll had to correct. I succeeded in creating a comparative set with a featureless selection like results. The p-value of the comparison with my actual results will be my final p-value for this model in particular – NDe Jun 15 '23 at 10:01
1

Ok, then I misunderstood what you're trying to do -- sounds like this would be simply making predictions with the tuned learner on the validation set? – Lars Kotthoff Jun 15 '23 at 17:32
yes you're right ! The goal is to compare my different models on the validation set in order to select the "best one" to test on the independent set. If I benchmark all the models, then select the best, it would be cheating... – NDe Jun 15 '23 at 19:31
So you essentially want to tune to find the best model? You can also do that directly with mlr3, see https://mlr3book.mlr-org.com/optimization.html – Lars Kotthoff Jun 16 '23 at 07:50
In fact, I wondered if there was a solution on mlr3 to 1) Optimize multiple learners on my validation set with their prediction on this same set (I found how to do it with resample result) 2)Extract a p-value to see if the predictions are different from random prediction in comparison with ground truth...I did it manually by extracting the predictions with a confidence interval using the pROC package – NDe Jun 16 '23 at 08:23
Well, what you can do quite easily is optimize the learners and then compare predictions of the optimized learner to a featureless one on a validation set (and compute p-values if you like). It's not really necessary to compare each model during the optimization process to a baseline because only the best one counts in the end. – Lars Kotthoff Jun 16 '23 at 08:58
Yes I understand but you put the finger exactly where I'm stuck...because this part : "optimize the learners and then compare predictions of the optimized learner to a featureless one on a validation set" for me signifies that I must create a training set, a validation set (optimization), another validation set (to make predictions and benchmark), and then, finally, a test set to estimate to generalization capacities of my best model selected on the previous step...There is no automated solution I mean, I must extract each prediction tables for each model... – NDe Jun 16 '23 at 09:30
1

You can simply do a nested resampling for the optimization. This will give you unbiased generalization performance estimates. The only additional step you'd have to do is to compute p-values, for which it's easiest to make predictions on a separate data set (that was not included in the nested resampling). – Lars Kotthoff Jun 16 '23 at 09:45
You could also create a custom measure that computes p-values and use that at every step along this process. – Lars Kotthoff Jun 16 '23 at 09:47
1

Finally, I found a solution for my problem : 1)Nested resampling with customized resampling. 2)Extraction of predictions on the validation set (inner resampling) -> Select the best results for each learner and task. 3) Compute a p-value for each AUC by creating random prediction (featureless selection) and compare with it and the ground truth. 4) After the selection on the validation, I check the results of the selected model on the test (I did a nested resampling so corresponding to the prediction with the outer customized resampling. I edit – NDe Jun 16 '23 at 13:12

p-values for optimization on validation set MLR

1 Answers1