3

I was able to use following method to do cross validation on binary data, but it seems not working for multiclass data:

> cross_validation.cross_val_score(alg, X, y, cv=cv_folds, scoring='roc_auc')

/home/ubuntu/anaconda3/lib/python3.6/site-packages/sklearn/metrics/scorer.py in __call__(self, clf, X, y, sample_weight)
    169         y_type = type_of_target(y)
    170         if y_type not in ("binary", "multilabel-indicator"):
--> 171             raise ValueError("{0} format is not supported".format(y_type))
    172 
    173         if is_regressor(clf):

ValueError: multiclass format is not supported

> y.head()

0    10
1     6
2    12
3     6
4    10
Name: rank, dtype: int64

> type(y)

pandas.core.series.Series

I also tried changing roc_auc to f1 but still having error:

/home/ubuntu/anaconda3/lib/python3.6/site-packages/sklearn/metrics/classification.py in precision_recall_fscore_support(y_true, y_pred, beta, labels, pos_label, average, warn_for, sample_weight)
   1016         else:
   1017             raise ValueError("Target is %s but average='binary'. Please "
-> 1018                              "choose another average setting." % y_type)
   1019     elif pos_label not in (None, 1):
   1020         warnings.warn("Note that pos_label (set to %r) is ignored when "

ValueError: Target is multiclass but average='binary'. Please choose another average setting.

Is there any method I can use to do cross validation for such type of data?

Deqing
  • 14,098
  • 15
  • 84
  • 131
  • ROC is only appropriate for binary classifiers. You should consider another scoring function or compute your ROC with a One vs Rest method. – sjakw Jul 31 '17 at 12:05
  • 1
    Check the `average` parameter [here](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score) and [here](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score) and use the appropriate one. – Vivek Kumar Jul 31 '17 at 13:28

1 Answers1

2

As pointed out in the comment by Vivek Kumar sklearn metrics support multi-class averaging for both the F1 score and the ROC computations, albeit with some limitations when data is unbalanced. So you can manually construct the scorer with the corresponding average parameter or use one of the predefined ones (e.g.: 'f1_micro', 'f1_macro', 'f1_weighted').

If multiple scores are needed, then instead of cross_val_score use cross_validate (available since sklearn 0.19 in the module sklearn.model_selection).

rll
  • 5,509
  • 3
  • 31
  • 46
  • What if I don't want to average the scoring result? I want the cross-validated scoring result for each class (f1score, precision, recall). This does not seem to be possible. With cross_validate in combination with the f1_score and average=None I get `ValueError: scoring must return a number`. Is there a possibility to do this? – LNA Mar 02 '21 at 14:17
  • Hi @LNA, you should check the documentation for `cross_val_score` and `cross_validate`, they actually return each fold's score. Is that what you mean? If you actually want it by class you can check [imabalanced learn](https://github.com/scikit-learn-contrib/imbalanced-learn) should have what you need. – rll Mar 04 '21 at 15:19
  • No, I don't think we are talking about the same thing. If I have 5 classes and 3 folds, cross_validate provides the f1_score for each fold, so I will get 3 numbers, but I don't really care about the results of the individual folds. I want to the results for the f1score for each of the classes, so 5 numbers for the averaged results (over the 3 folds) for each class OR 3x5 numbers for all folds and all classes and I will do the averaging on my own. So it seemed this should be possible by setting average=None, but cross_validate throws an error. Seems it can only handle numbers. – LNA Mar 08 '21 at 10:13