2

We have been running 'gbm' models on dataset of about 15k rows. We have directly implemented 10 fold cross-validation to come up with a cross-validated model, which we are using to predict again on the same dataset.

This has resulted in probably overfitted models with about 0.99 training AUC, and 0.92 cv AUC. The prediction AUC is also really high of about 0.99.

Reviewers have asked us to validate the model with a holdout dataset. Which we are assuming that we would split the data into a holdout data and training data. Then the training data will undergo again in kfold cross-validation. The model will be then validated with holdout dataset. My final question is whether we can use the validated model again on the whole dataset for prediction?

  • 1
    You can use the validated model on whatever you want, but you should report the performance on the holdout dataset alone as your best estimate of actual model performance. – Gregor Thomas Apr 18 '18 at 02:08

1 Answers1

3

You can... the question of should you depends on what you are trying to portray.

Ideally you want be able to show that your model generalises well to new data (the holdout) and compare that to how the model performs on the training data. If your model has a large discrepancy in performance between the two you likely have overfit the data.

I wouldn't see much point in predicting across all the data (training and holdout) at once as it doesn't help demonstrate the models ability to predict on unseen data.

You would aim to provide the performance on the training data during k-CV and then on the holdout.

Depending on your k-CV setup you would train the model on the entire training set before predicting on the both before comparing. You would need to be more specific in describing your exact setup.

zacdav
  • 4,603
  • 2
  • 16
  • 37