high F1 score and low values in confusion matrix

Question

consider I have 2 classes of data and I am using sklearn for classification,

def cv_classif_wrapper(classifier, X, y, n_splits=5, random_state=42, verbose=0):
    '''
    cross validation wrapper
    '''
    cv = StratifiedKFold(n_splits=n_splits, shuffle=True,
                         random_state=random_state)
    scores = cross_validate(classifier, X, y, cv=cv, scoring=[
                            'f1_weighted', 'accuracy', 'recall_weighted', 'precision_weighted'])
    if verbose:
        print(f"=====================")
        print(f"Accuracy:    {scores['test_accuracy'].mean():.3f} (+/- {scores['test_accuracy'].std()*2:.3f})")
        print(f"Recall:      {scores['test_recall_weighted'].mean():.3f} (+/- {scores['test_recall_weighted'].std()*2:.3f})")
        print(f"Precision:   {scores['test_precision_weighted'].mean():.3f} (+/- {scores['test_precision_weighted'].std()*2:.3f})")
        print(f"F1:          {scores['test_f1_weighted'].mean():.3f} (+/- {scores['test_f1_weighted'].std()*2:.3f})")

    return scores

and I call it by

scores = cv_classif_wrapper(LogisticRegression(), Xs, y0, n_splits=5, verbose=1)

Then I calculate the confusion matrix with this:

model = LogisticRegression(random_state=42)
y_pred = cross_val_predict(model, Xs, y0, cv=5)
cm = sklearn.metrics.confusion_matrix(y0, y_pred)

The question is I am getting 0.95 for F1 score but the confusion matrix is

Is this consistent with F1 score=0.95? Where is wrong if there is? note that there is 35 subject in class 0 and 364 in class 1.

Accuracy:    0.952 (+/- 0.051)
Recall:      0.952 (+/- 0.051)
Precision:   0.948 (+/- 0.062)
F1:          0.947 (+/- 0.059)

DataJanitor · Accepted Answer · 2023-04-18T09:46:36.453

Your data is imbalanced, i.e. the target classes are not equally distributed. As niid pointed out in their answer, the f1 score returned by default is the weighted f1 score, which can be misleading if not interpreted correctly, especially if your classes are not equally important. Think of customer churn or e-mail spam classification: Your model can be 99% correct (or have a very high f1 score) and still be useless.

We usually calcualte metrics to compare different models to each other. For this, often the area under the ROC curve (AUC-ROC) is used. It summarizes the information of the ROC, which shows True-Positive-Rate against False-Positive-Rate for different thresholds. By using this metric, you are using a metric which is independent of the threshold you choose — unlike accuracy, precision, recall and f1 score which all depend on the threshold you choose.

In the case of imbalanced data, the area under the precision recall curve (AUC-PR) is even more suitable for the comparison of different classifiers:

AUC-ROC is less informative in imbalanced data due to its reliance on sensitivity and specificity, which become less meaningful with class imbalance.
AUC-PR focuses on precision and recall, which are more sensitive to minority class performance and better suited for imbalanced datasets.
AUC-PR is more sensitive to class imbalance, providing a more realistic assessment of classifier performance in imbalanced scenarios.

Consequently, you might want to rethink your metrics.

Additionally, LogisticRegression() is not the best classifier for imbalanced data as it might be biased towards the majority class, leading to poor performance on the minority class. You might consider applying strategies to handle imbalanced data.

niid · Answer 2 · 2023-04-18T08:32:21.100

You are using the weighted F1 score. The F1 score is weighted by supported (support means the amount of examples in each class).

Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall. 1

Since your classes are very unevenly distributed, (the "other" class, i.e. 2nd class) has 10x more examples than the first class, they F1 score will be weighted towards that class.

We can also see that in the Formula (let the indices denote the classes):

(F1_1 * n1 + F1_2 * n2) / n1 + n2

In your case that would mean:

0.66 * 35 + 0.99 * 364 / 400 = 95.865

(not sure why the numbers differ, but you get the idea).

Since you are only using 2 classes (binary classification), a weighted F1 score is not the best score to use in my opinion and I would use the regular unweighted version. Others would agree 2.

high F1 score and low values in confusion matrix

2 Answers2