9

I am trying to predict a set of labels using Logistic Regression from SciKit. My data is really imbalanced (there are many more '0' than '1' labels) so I have to use the F1 score metric during the cross-validation step to "balance" the result.

[Input]
X_training, y_training, X_test, y_test = generate_datasets(df_X, df_y, 0.6)
logistic = LogisticRegressionCV(
    Cs=50,
    cv=4,
    penalty='l2', 
    fit_intercept=True,
    scoring='f1'
)
logistic.fit(X_training, y_training)
print('Predicted: %s' % str(logistic.predict(X_test)))
print('F1-score: %f'% f1_score(y_test, logistic.predict(X_test)))
print('Accuracy score: %f'% logistic.score(X_test, y_test))

[Output]
>> Predicted: [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
>> Actual:    [0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 1]
>> F1-score: 0.285714
>> Accuracy score: 0.782609
>> C:\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:958:  
   UndefinedMetricWarning:
   F-score is ill-defined and being set to 0.0 due to no predicted samples.

I certainly know that the problem is related to my dataset: it is too small (it is only a sample of the real one). However, can anybody explain the meaning of the "UndefinedMetricWarning" warning that I am seeing? What is actually happening behind the curtains?

David
  • 511
  • 2
  • 4
  • 15
  • 7
    On a side note, if your dataset is REALLY imbalanced (say 100000 of '0' and just 20 '1') you may want to go away from classification task to anomaly detection approach. For extremely skewed cases it will work much better. Details: http://scikit-learn.org/stable/modules/outlier_detection.html – Maksim Khaitovich Jul 28 '15 at 15:24
  • 2
    The imbalance here is 70-30% approximately so I think it is still suitable to use classic classifiers. However, your comment might be extremely valuable for people struggling with really skewed datasets so thank you for the hint anyway :) – David Jul 28 '15 at 19:26

2 Answers2

4

It seems it is a known bug here which has been fixed, I guess you should try update sklearn.

Geeocode
  • 5,705
  • 3
  • 20
  • 34
  • 3
    I have this error message with scikit-learn 0.17. Any updates on this matter? My classes are almost balanced. – OAK Feb 05 '16 at 01:30
3

However, can anybody explain the meaning of the "UndefinedMetricWarning" warning that I am seeing? What is actually happening behind the curtains?

This is well-described at https://stackoverflow.com/a/34758800/1587329:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/classification.py

F1 = 2 * (precision * recall) / (precision + recall)

precision = TP/(TP+FP) as you've just said if predictor doesn't predicts positive class at all - precision is 0.

recall = TP/(TP+FN), in case if predictor doesn't predict positive class - TP is 0 - recall is 0.

So now you are dividing 0/0.

To fix the weighting problem (it's easy for the classifier to (almost) always predict the more prevalent class), you can use class_weight="balanced":

logistic = LogisticRegressionCV(
    Cs=50,
    cv=4,
    penalty='l2', 
    fit_intercept=True,
    scoring='f1',
    class_weight="balanced"
)

LogisticRegressionCV says:

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).

serv-inc
  • 35,772
  • 9
  • 166
  • 188