Why does my XBGoost model have a good accuracy for training and testing dataset, but poor one for predicting an held out dataset?

Question

I'm currently working on a XGBoost regression model to predict ticket bookings. My issue is that my model has a good accuracy for the training set (around 96%) and for the testing set (around 94%) but when I try to use the model to predict my booking on another held out dataset the accuracy on this one drop to 82%. I tried switching some data from my testing set to this held out set and the accuracy is still pretty bad, even though the model can efficiently predict these data when they're inside my testing set. I assume I'm doing something wrong but I can't figure out what. Any help would be appreciated, thanks

Here's the XGBoost model part of my code:

import xgboost as xgb
from sklearn.metrics import mean_squared_error

X_conso, y_conso = data_conso2.iloc[:,:-1],data_conso2.iloc[:,-1]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_conso, y_conso, test_size=0.3, random_state=20)

d_train = xgb.DMatrix(X_train, label = y_train)
d_test = xgb.DMatrix(X_test, label = y_test)
d_fcst_held_out = xgb.DMatrix(X_fcst_held_out)


params = {'p_colsample_bytree_conso' : 0.9, 
          'p_colsample_bylevel_conso': 0.9,
          'p_colsample_bynode_conso': 0.9,
          'p_learning_rate_conso': 0.3,
          'p_max_depth_conso': 10,
          'p_alpha_conso': 3,
          'p_n_estimators_conso': 10,
          'p_gamma_conso': 0.8}

steps = 100

watchlist = [(d_train, 'train'), (d_test, 'test')]
model = xgb.train(params, d_train, steps, watchlist, early_stopping_rounds = 50)

preds_train = model.predict(d_train)
preds_test = model.predict(d_test)
preds_fcst = model.predict(d_fcst_held_out)

And my accuracy levels :

Error train: 4.524787%
Error test: 5.978759%
Error fcst: 18.008451%

score 0 · Answer 1 · answered Jan 07 '22 at 07:05

0

This is generally normal, the unseen data usually has lower accuracy.

To improve accuracy on data you may optimize your parameters using for example optuna.

answered Jan 07 '22 at 07:05

ferdy

4,396
2
4
16

Why does my XBGoost model have a good accuracy for training and testing dataset, but poor one for predicting an held out dataset?

1 Answers1