I am doing multi-class classification, with unbalanced categories.
I noticed f1 is always smaller than the direct harmonic mean of precision and recall, and in some cases, f1 is even smaller than both precision and recall.
FYI, I called metrics.precision_score(y,pred)
for precision and so on.
I am aware of the difference of micro/macro average, and tested that they are not micro by using the category results from precision_recall_fscore_support()
.
Not sure is this due to macro-average is used or some other reasons?
Updated detailed results as below:
n_samples: 75, n_features: 250
MultinomialNB(alpha=0.01, fit_prior=True)
2-fold CV:
1st run:
F1: 0.706029106029
Precision: 0.731531531532
Recall: 0.702702702703
precision recall f1-score support
0 0.44 0.67 0.53 6
1 0.80 0.50 0.62 8
2 0.78 0.78 0.78 23
avg / total 0.73 0.70 0.71 37
2nd run:
F1: 0.787944219523
Precision: 0.841165413534
Recall: 0.815789473684
precision recall f1-score support
0 1.00 0.29 0.44 7
1 0.75 0.86 0.80 7
2 0.82 0.96 0.88 24
avg / total 0.84 0.82 0.79 38
Overall:
Overall f1-score: 0.74699 (+/- 0.02)
Overall precision: 0.78635 (+/- 0.03)
Overall recall: 0.75925 (+/- 0.03)
Definitions about micro/macro-averaging from Scholarpedia:
In multi-label classification, the simplest method for computing an aggregate score across categories is to average the scores of all binary task. The resulted scores are called macro-averaged recall, precision, F1, etc. Another way of averaging is to sum over TP, FP, TN, FN and N over all the categories first, and then compute each of the above metrics. The resulted scores are called micro-averaged. Macro-averaging gives an equal weight to each category, and is often dominated by the system’s performance on rare categories (the majority) in a power-law like distribution. Micro-averaging gives an equal weight to each document, and is often dominated by the system’s performance on most common categories.
It is a current open issue in Github, #83.
Following example demonstrates how Micro, Macro and weighted(current in Scikit-learn) averaging may differ:
y = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2]
pred = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 2, 0, 1, 2, 2, 2, 2]
Confusion matrix:
[[9 3 0]
[3 5 1]
[1 1 4]]
Wei Pre: 0.670655270655
Wei Rec: 0.666666666667
Wei F1 : 0.666801346801
Wei F5 : 0.668625356125
Mic Pre: 0.666666666667
Mic Rec: 0.666666666667
Mic F1 : 0.666666666667
Mic F5 : 0.666666666667
Mac Pre: 0.682621082621
Mac Rec: 0.657407407407
Mac F1 : 0.669777037588
Mac F5 : 0.677424801371
F5 above is a shorthand for F0.5...