There is just one F-1 score - the harmonic mean of precision and recall.
Macro/Micro/Samples/Weighted/Binary are used in the context of multiclass/multilabel targets. If None
, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:
binary
: Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.
micro
: Calculate metrics globally by counting the total true positives, false negatives and false positives.
macro
: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
weighted
: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.
samples
: Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from accuracy_score)
Segnet paper is discussing different classes accuracy separately in Table#5. So I think they have chosen None
in this case.