F1 smaller than both precision and recall in Scikit-learn

Question

I am doing multi-class classification, with unbalanced categories.

I noticed f1 is always smaller than the direct harmonic mean of precision and recall, and in some cases, f1 is even smaller than both precision and recall.

FYI, I called metrics.precision_score(y,pred) for precision and so on.

I am aware of the difference of micro/macro average, and tested that they are not micro by using the category results from precision_recall_fscore_support().

Not sure is this due to macro-average is used or some other reasons?

Updated detailed results as below:

n_samples: 75, n_features: 250

MultinomialNB(alpha=0.01, fit_prior=True)

2-fold CV:

1st run:

F1:        0.706029106029
Precision: 0.731531531532
Recall:    0.702702702703

         precision    recall  f1-score   support

      0       0.44      0.67      0.53         6
      1       0.80      0.50      0.62         8
      2       0.78      0.78      0.78        23

avg / total       0.73      0.70      0.71        37

2nd run:

F1:        0.787944219523
Precision: 0.841165413534
Recall:    0.815789473684

         precision    recall  f1-score   support

      0       1.00      0.29      0.44         7
      1       0.75      0.86      0.80         7
      2       0.82      0.96      0.88        24

avg / total       0.84      0.82      0.79        38

Overall:

Overall f1-score:   0.74699 (+/- 0.02)
Overall precision:  0.78635 (+/- 0.03)
Overall recall:     0.75925 (+/- 0.03)

Definitions about micro/macro-averaging from Scholarpedia:

In multi-label classification, the simplest method for computing an aggregate score across categories is to average the scores of all binary task. The resulted scores are called macro-averaged recall, precision, F1, etc. Another way of averaging is to sum over TP, FP, TN, FN and N over all the categories first, and then compute each of the above metrics. The resulted scores are called micro-averaged. Macro-averaging gives an equal weight to each category, and is often dominated by the system’s performance on rare categories (the majority) in a power-law like distribution. Micro-averaging gives an equal weight to each document, and is often dominated by the system’s performance on most common categories.

It is a current open issue in Github, #83.

Following example demonstrates how Micro, Macro and weighted(current in Scikit-learn) averaging may differ:

y    = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2]
pred = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 2, 0, 1, 2, 2, 2, 2]

Confusion matrix:

[[9 3 0]
 [3 5 1]
 [1 1 4]]

Wei Pre: 0.670655270655
Wei Rec: 0.666666666667
Wei F1 : 0.666801346801
Wei F5 : 0.668625356125

Mic Pre: 0.666666666667
Mic Rec: 0.666666666667
Mic F1 : 0.666666666667
Mic F5 : 0.666666666667

Mac Pre: 0.682621082621
Mac Rec: 0.657407407407
Mac F1 : 0.669777037588
Mac F5 : 0.677424801371

F5 above is a shorthand for F0.5...

If we are macro-averaging, it means that precision, recall, and F1-score are computed for each observation, and then, the averages of all the precision values, recall values, and F1-score values are returned. So the harmonic mean of final precision and final recall is definitely not going to equate the final F1-score — Antoine, Oct 19 '16 at 08:09

ogrisel · Accepted Answer · 2011-11-27T14:53:08.437

2

Can you please update your question with the output of:

>>> from sklearn.metrics import classification_report
>>> print classification_report(y_true, y_predicted)

That will display the precisions and recalls for each individual category along with the support and hence help us make sense of how the averaging works and decide whether this is an appropriate behavior or not.

edited Nov 27 '11 at 14:53

answered Nov 27 '11 at 12:35

ogrisel

39,309
12
116
125

Checked the results. Seems neither micro nor macro is used. And the strange behavior F1 smaller than both precision and recall occurs in the 2nd run, and just realized, it is also partially caused by the nature of harmonic mean, where Harmonic(1.00,0.29)=0.44 is against my direct intuition but true. However, the method of non micro/macro may also be another cause. – Flake Nov 28 '11 at 21:56
1

The actual scikit-learn implementation is a weighted average accross classes where the weights are the support (number of samples in each class). So to me it sounds like micro-averaging but I have not worked out the details so it might not be equivalent at all. If you would like to contribute a real implementation of the micro-averaging using TP, FP, TN, FN averaged across classes, please feel free to send a pull-request. – ogrisel Nov 28 '11 at 22:17
I will look into the code more carefully and figure it out. :) I am very new to Python, so, will see if that can happen in the future. Anyhow, I am really interested in and highly appreciate the work from your guys in Scikit-learn. – Flake Nov 28 '11 at 22:23
Found this issue track https://github.com/scikit-learn/scikit-learn/issues/83. I tried an example, current calculation indeed is neither macro nor micro for n>2. Tricky behaviors brought by are things like f1 < both precision and recall. – Flake Nov 28 '11 at 23:09
Indeed I forgot about this issue. My memory does not last for more than a couple of months anymore. Blame twitter induced attention disorder I guess... So indeed please feel free to step up and submit a pull request for micro-averaging (I don't think macro averaging is that useful but it's simpler to implement). – ogrisel Nov 29 '11 at 22:41
I sent an email to the list about this too. Seems it is pending due to file size, that I accidentally paste SE code into email which turned out to be pictures... I am new to Python, still not on the level to contribute quality code. But I experimented a little, found another strange thing -- F1,recall,precision seems to be the same by micro-averaging for multi-class classification. – Flake Nov 30 '11 at 00:03

F1 smaller than both precision and recall in Scikit-learn

1 Answers1