Imo the issue can be found in the fact that you're constructing y_tests
and y_probabilities
as lists. Entering into the skplt.metrics.plot_roc
implementation you can see that they are required to be array-like.
y_true : (array-like, shape (n_samples)):
Ground truth (correct) target values.
y_probas : (array-like, shape (n_samples, n_classes)):
Prediction probabilities for each class returned by a classifier.
To my knowledge, within sklearn
(on the other hand I don't know the convention of scikit-plot
), whenever the documentation states that you can use an array-like datatype, this does not preclude the possibility of using a list in place of a np array (eg if array-like of str is specified you might use a list of strings). On the other side, they are more explicit when an argument is specifically required to be a np array, and they specify ndarray type (see sklearn roc_curve for instance).
Given that:
multi_class: {‘auto’, ‘ovr’, ‘multinomial’}, default=’auto’
If the option chosen is ‘ovr’, then a binary problem is fit for each label. For ‘multinomial’ the loss minimised is the multinomial loss fit across the entire probability distribution, even when the data is binary. ‘multinomial’ is unavailable when solver=’liblinear’. ‘auto’ selects ‘ovr’ if the data is binary, or if solver=’liblinear’, and otherwise selects ‘multinomial’.
- both
y_true
and y_probas
are transformed into np arrays here
.roc_curve()
from sklearn
is called here
you'll end up having y_tests
and y_probabilities
as lists of np arrays (and ultimately lists of lists after this point - second premise above) that are not accepted by .roc_curve()
from sklearn
y_true: ndarray of shape (n_samples,)
y_score: ndarray of shape (n_samples,)
Here's the y_tests
I got from the iris dataset via your code before passing it to skplt.metrics.plot_roc
.
[array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]),
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]),
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]),
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]),
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]),
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]),
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]),
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]),
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]),
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2])]
Instead, I was able to solve your issue by imposing them to be np arrays both on iris
and digits
datasets, as follows
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
import scikitplot as skplt
iris = datasets.load_iris()
X = iris.data
y = iris.target
lr_model3 = LogisticRegression(max_iter=200, penalty='l2')
cv = StratifiedKFold(n_splits=10)
for train_index, test_index in cv.split(X,y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
lr_model3.fit(X_train,y_train)
y_probas = lr_model3.predict_proba(X_test)
y_probabilities = np.array(y_probas)
y_tests = np.array(y_test)
skplt.metrics.plot_roc(y_tests, y_probabilities)
Eventually, here are some hints from a similar sklearn
example of extension of the roc curve to a multiclass setting: Plotting the ROC curve for a multiclass problem