12

Im working on a binary classification model, classifier is naive bayes. I have an almost balanced dataset however I get the following error message when I predict:

UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

I'm using gridsearch with CV k-fold 10. The test set and predictions contain both classes, so I don't understand the message. I'm working on the same dataset, train/test split, cv and random seed for 6 other models and those work perfect. Data is ingested externally into a dataframe, randomize and seed is fixed. Then the naive bayes classification model class the file at the beginning of before this code snippet.

X_train, X_test, y_train, y_test, len_train, len_test = \
     train_test_split(data['X'], data['y'], data['len'], test_size=0.4)
pipeline = Pipeline([
    ('classifier', MultinomialNB()) 
])

cv=StratifiedKFold(len_train, n_folds=10)

len_train = len_train.reshape(-1,1)
len_test = len_test.reshape(-1,1)

params = [
  {'classifier__alpha': [0, 0.0001, 0.001, 0.01]}

]

grid = GridSearchCV(
    pipeline,
    param_grid=params,
    refit=True,  
    n_jobs=-1, 
    scoring='accuracy',
    cv=cv, 
)

nb_fit = grid.fit(len_train, y_train)

preds = nb_fit.predict(len_test)

print(confusion_matrix(y_test, preds, labels=['1','0']))
print(classification_report(y_test, preds))

I was 'forced' by python to alter the shape of the series, maybe that is the culprit?

OAK
  • 2,994
  • 9
  • 36
  • 49
  • What version of scikit-learn you using?@OAK – Farseer Feb 05 '16 at 21:31
  • @Farseer version 0.17. I read there was a bug in a previous version, not sure if there is on also in this one. – OAK Feb 05 '16 at 22:33
  • 1
    This warning means that precision, and consequently f1 score, are undefined for some samples whose tp + fp is zero which results in 0 / 0 when calculating precision for that sample. Because f1 score is a function of precision, it is also undefined and both are set to 0.0 by the library. – aadel Sep 25 '16 at 19:27
  • @OAK if the below answer satisfies, can you please mark as answered? Otherwise, lmk what is unclear. Thanks. – Ori Feb 23 '21 at 12:07

2 Answers2

13

The meaning of the warning

As the other answers here suggest, you encounter a situation where precision F-Score can't be computed due to its definition (precision/recall equal to 0). In this cases, the score of the metric is valued at 0.

Test data contains all labels, why does this still happen?

Well, you are using K-Fold (specifically in your case k=10), which means that one specific split might contain 0 samples of one class

Still happens, even when using Stratified K-Fold

This is a little tricky. Stratified K-Fold ensures the same portion of each class in each split. However, this does not only depend on the real classes. For example, Precision is computed like so: TP/predicted yes. If for some reason, you are predicting all of your samples with No, you will have predicted yes=0, which will result in undefined precision (which can lead to undefined F-Score).

This sounds like an edge case but consider the fact that in grid-search, you are probably searching for a whole lot of different combinations, which some might be totally off, and result in such scenario.

I hope this answers your question!

Ori
  • 1,680
  • 21
  • 24
8

As aadel has commented, when no data points are classified as positive, precision divides by zero as it is defined as TP / (TP + FP) (i.e., true positives / true and false positives). The library then sets precision to 0, but issues a warning as actually the value is undefined. F1 depends on precision and hence is not defined either.

Once you are aware of this, you can choose to disable the warning with:

import warnings
import sklearn.exceptions
warnings.filterwarnings("ignore", category=sklearn.exceptions.UndefinedMetricWarning)