How to make multiclass cross-validated ROC curve in SKLEARN?

Question

I have the following code that outpus the ROC curve of every iteration from the stratified cross validation:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn import metrics
import scikitplot as skplt
import numpy as np

lr_model3 = LogisticRegression(max_iter=10000, penalty='l2')

y_tests = []
y_probabilities = []

print(X.shape)
print(y.shape)


cv = StratifiedKFold(n_splits=10)
for train_index, test_index in cv.split(X,y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    lr_model3.fit(X_train,y_train)
    y_probas = lr_model3.predict_proba(X_test)
    y_probabilities.append(y_probas);
    y_tests.append(y_test)


for i in range(10):
    skplt.metrics.plot_roc(y_tests[i], y_probabilities[i], title = 'Iteration {} ROC Curve'.format(i+1))

It outputs the following:

Until the 10th iteration

However, what I want is to display just one ROC curve that summarizes the 10 ROC curves. Is that possible? This is my attempt I am open to other solutions too.

amiola · Answer 1 · 2022-02-01T13:35:36.983

Imo the issue can be found in the fact that you're constructing y_tests and y_probabilities as lists. Entering into the skplt.metrics.plot_roc implementation you can see that they are required to be array-like.

y_true : (array-like, shape (n_samples)): Ground truth (correct) target values.

y_probas : (array-like, shape (n_samples, n_classes)): Prediction probabilities for each class returned by a classifier.

To my knowledge, within sklearn (on the other hand I don't know the convention of scikit-plot), whenever the documentation states that you can use an array-like datatype, this does not preclude the possibility of using a list in place of a np array (eg if array-like of str is specified you might use a list of strings). On the other side, they are more explicit when an argument is specifically required to be a np array, and they specify ndarray type (see sklearn roc_curve for instance).

Given that:

LogisticRegression works inherently multiclass in your setting (see settings for parameter multi_class and the User Guide)

multi_class: {‘auto’, ‘ovr’, ‘multinomial’}, default=’auto’ If the option chosen is ‘ovr’, then a binary problem is fit for each label. For ‘multinomial’ the loss minimised is the multinomial loss fit across the entire probability distribution, even when the data is binary. ‘multinomial’ is unavailable when solver=’liblinear’. ‘auto’ selects ‘ovr’ if the data is binary, or if solver=’liblinear’, and otherwise selects ‘multinomial’.

both y_true and y_probas are transformed into np arrays here
.roc_curve() from sklearn is called here

you'll end up having y_tests and y_probabilities as lists of np arrays (and ultimately lists of lists after this point - second premise above) that are not accepted by .roc_curve() from sklearn

y_true: ndarray of shape (n_samples,)

y_score: ndarray of shape (n_samples,)

Here's the y_tests I got from the iris dataset via your code before passing it to skplt.metrics.plot_roc.

[array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]), 
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]), 
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]), 
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]), 
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]), 
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]), 
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]), 
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]), 
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]), 
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2])]

Instead, I was able to solve your issue by imposing them to be np arrays both on iris and digits datasets, as follows

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
import scikitplot as skplt

iris = datasets.load_iris()
X = iris.data
y = iris.target

lr_model3 = LogisticRegression(max_iter=200, penalty='l2')

cv = StratifiedKFold(n_splits=10)
for train_index, test_index in cv.split(X,y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    lr_model3.fit(X_train,y_train)
    y_probas = lr_model3.predict_proba(X_test)
    y_probabilities = np.array(y_probas)
    y_tests = np.array(y_test)

skplt.metrics.plot_roc(y_tests, y_probabilities)

Eventually, here are some hints from a similar sklearn example of extension of the roc curve to a multiclass setting: Plotting the ROC curve for a multiclass problem

How to make multiclass cross-validated ROC curve in SKLEARN?

1 Answers1