Titanic Dataset overfitting: can it be that much?

Question

I am a bit confused as I am training a model that yields circa 88% CV score on the train data while the same model performs poorly on the test data after I submit it ( score of 0.75). This drop of 12 points in accuracy can't be all due to overfitting, no? Any ideas? Have you experienced such a gap in your models/submissions?

See the enclosed image for the model and results.

##########################################################

xgb_clf = XGBClassifier(n_estimators=87, learning_rate=0.05,max_depth = 10,
colsample_bytree =0.8 , n_jobs=-1 , random_state = 7,
scale_pos_weight = 0.6, min_child_weight = 0.9, gamma = 2.1)
skf = RepeatedStratifiedKFold(n_splits = 4)

results= cross_val_score(xgb_clf , X_train , y , cv =skf, scoring='roc_auc')
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()100, results.std()100))

Accuracy: 88.13% (2.47%)

Hi and welcome to the site. This question seems to be less about programming and more about machine learning concepts. In the future, you might consider asking such questions on CrossValidated (https://stats.stackexchange.com/) — fujiu, Dec 12 '20 at 10:10

score 0 · Accepted Answer · answered Dec 12 '20 at 10:07

Yes, this absolutely can indicate overfitting. A 12% difference between training- and test-accuracy is not unusual. In fact, in extreme cases of overfitting, you might even observe 100% accuracy on the training set and an accuracy at chance level for the test data.

Titanic Dataset overfitting: can it be that much?

1 Answers1