0

I have a training data with 3961 different rows and 32 columns I want to fit to a Random Forest and a Gradient Boosting model. While training, I need to fine-tune the hyper-parameters of the models to get the best AUC possible. To do so, I minimize the quantity 1-AUC(Y_real,Y_pred) using the Basin-Hopping algorithm described in Scipy; so my training and internal validation subsamples are the same.

When the optimization is finished, I get for Random Forest an AUC=0.994, while for the Gradient Boosting I get AUC=1. Am I overfitting these models? How could I know when an overfitting is taking place during training?

  • Is this for your test or train dataset? – Hamish Gibson Jan 19 '21 at 18:34
  • Have you checked the test/validation accuracy and compared it to training accuracy? Overfitting means that your model is not generalizable to unseen data. Only way to check if model is overfitting is to train a model and compare its results on unseen data. – Akshay Sehgal Jan 19 '21 at 18:35
  • I train the models with the data and compute the AUC with the same training data. Then I optimized the hyper parameters by minimizing the difference 1-AUC, where the AUC is computed with the same training data and the predicted on the training data. – Ernesto Lopez Fune Jan 19 '21 at 18:44

1 Answers1

0

To know if your are overfitting you have to compute:

  • Training set accuracy (or 1-AUC in your case)
  • Test set accuracy (or 1-AUC in your case)(You can use validation data set if you have it)

Once you have calculated this scores, compare it. If training set score is much better than your test set score, then you are overfitting. This means that your model is "memorizing" your data, instead of learning from it to make future predictions.

To know if you are overfitting, you always need to do this process. However, if your training accuracy or score is too perfect (e.g. accuracy of 100%), you can sense that you are overfitting too.

So, if you don't have training and test data, you have to create it using sklearn.model_selection.train_test_split. Then you will be able to compare both accuracy. Otherwise, you won't be able to know, with confidence, if you are overfitting or not.

Alex Serra Marrugat
  • 1,849
  • 1
  • 4
  • 14