I'm building a decision tree model based on data from the "Give me some credit" Kaggle competition (https://www.kaggle.com/competitions/GiveMeSomeCredit/overview). I'm trying to train this model on the training dataset from the competition and then apply to it to my own dataset for research.
The problem I'm facing is that it looks like the f1 score my model gets and the results presented by the confusion matrix do not correlate, and the higher the f1 score is, the worse label prediction becomes. Currently my best parameters for maximizing f1 are the following (the way I measure the score is included):
from sklearn.model_selection import RandomizedSearchCV
import xgboost
classifier=xgboost.XGBClassifier(tree_method='gpu_hist', booster='gbtree', importance_type='gain')
params={
"colsample_bytree":[0.3],
"gamma":[0.3],
"learning_rate":[0.1],
"max_delta_step":[1],
"max_depth":[4],
"min_child_weight":[9],
"n_estimators":[150],
"num_parallel_tree":[1],
"random_state":[0],
"reg_alpha":[0],
"reg_lambda":[0],
"scale_pos_weight":[4],
"validate_parameters":[1],
"n_jobs":[-1],
"subsample":[1],
}
clf=RandomizedSearchCV(classifier,param_distributions=params,n_iter=100,scoring='f1',cv=10,verbose=3)
clf.fit(X,y)
These parameters give me an f1 score of ≈0.46.
However, when this model is output onto a confusion matrix, the label prediction accuracy for label "1" is only 50% (Picture below).
When attempting to tune the parameters in order to achieve better label prediction, I can improve the label prediction accuracy to 97% for both labels, however that decreases the f1 score to about 0.3. Here's the code I use for creating the confusion matrix (parameters included are the ones that have the f1 score of 0.3):
from xgboost import XGBClassifier
from numpy import nan
final_model = XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.7,
early_stopping_rounds=None, enable_categorical=False,
eval_metric=None, gamma=0.2, gpu_id=0, grow_policy='depthwise',
importance_type='gain', interaction_constraints='',
learning_rate=1.5, max_bin=256, max_cat_to_onehot=4,
max_delta_step=0, max_depth=5, max_leaves=0, min_child_weight=9,
missing=nan, monotone_constraints='()', n_estimators=800,
n_jobs=-1, num_parallel_tree=1, predictor='auto', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=5)
final_model.fit(X,y)
pred_xgboost = final_model.predict(X)
cm = confusion_matrix(y, pred_xgboost)
cm_norm = cm/cm.sum(axis=1)[:, np.newaxis]
plt.figure()
fig, ax = plt.subplots(figsize=(10, 10))
plot_confusion_matrix(cm_norm, classes=rf.classes_)
And here's the confusion matrix for these parameters:
I don't understand why there is seemingly no correlation between these two metrics (f1 score and confusion matrix accuracy), perhaps a different scoring system would prove more useful?