I've trained an xgboost classifier (on train_df) and tuned (on valid_df) and tested (on test_df). Some non-trivial observations follow. After running HyperOpt trials, I obtain the following performance with the Precision scores.
- Model 1: train: 0.16, valid: 0.12, test:0.12
- Model 2: train: 0.23, valid: 0.18, test:0.17
Before rushing to conclude that these two models are overfitting, can somebody tell what's happening here and what can be done to resolve it? If this is indeed overfitting is it something to worry about? I am getting consistent results in valid_df and test_df? In addition, on an out-of-sample(i.e., unseen data) evaluation, I get a performance consistent with test_df performance. Since we care only about the test_df performance metric since it's an indication of the real performance of the model, do we need to worry about how much overfitting (if this is overfitting) the model is having on train_df? At the end of the day, test_df performance is all what matters.