0

I have written down a code which classifies the multiclass data.

import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve, auc
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from itertools import cycle
import pandas as pd 

##########################################################################################
df = pd.read_csv('merged_Zero_Cor_cleaned.tsv',sep='\t')

X = df.drop(columns='class')
y = df['class']

y_bin = label_binarize(y, classes=[0, 1, 2, 3, 4])
n_classes = y_bin.shape[1]

clf = OneVsRestClassifier(QDA())
y_score = cross_val_predict(clf, X, y, cv=10 ,method='predict_proba')
y_pred = cross_val_predict(clf, X, y, cv=10 )

lw = 2

fpr = dict()
tpr = dict()
roc_auc = dict()

for i in range(n_classes):

    df = pd.DataFrame(y_score[:, i])
    df = df.fillna(0)
    fpr[i], tpr[i], _ = roc_curve(y_bin[:, i], df.T.values[0])
    roc_auc[i] = auc(fpr[i], tpr[i])

colors = cycle(['blue', 'red', 'green','black', 'brown'])

for i, color in zip(range(n_classes), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=lw,
             label='ROC curve of class {0} (area = {1:0.2f})'
             ''.format(i, roc_auc[i]))

plt.plot([0, 1], [0, 1], 'k--', lw=lw)

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for multi-class data')
plt.legend(loc="lower right")
plt.show()

##########################################################################################

In the above code to check the performance, I am calculating the predicted probability score and labels in two different lines.

y_score = cross_val_predict(clf, X, y, cv=10 ,method='predict_proba')
y_pred = cross_val_predict(clf, X, y, cv=10 )

Which is computationally expensive. Is there any way that I can get both the outputs in one line.

Update

Or How can we interpret the class with this probabilities?

      0         1         2         3              4
0      0.0  0.250000  0.250000  0.250000   2.500000e-01
1      0.0  0.000000  0.000000  1.000000   0.000000e+00
2      0.0  0.250000  0.250000  0.250000   2.500000e-01
3      0.0  0.000000  0.333333  0.333333   3.333333e-01
4      0.0  0.000000  0.000000  1.000000   0.000000e+00
5      0.0  0.000000  0.000000  1.000000   8.744693e-23
6      0.0  0.333333  0.333333  0.333333   9.255446e-105
jax
  • 3,927
  • 7
  • 41
  • 70
  • From the docs: "For method=’predict_proba’, the columns correspond to the classes in sorted order." So then you should have an array of predicted probabilities, and the column with the highest probability should match the class predicted by `predict`, so you shouldn't need to also predict the class – G. Anderson Sep 09 '19 at 16:21
  • @G.Anderson thanks for your reply I have updated my question can you suggest me how can I interpret the class from the probability score as per your suggestion. – jax Sep 09 '19 at 18:48
  • You can use the answers discussed in [this question](https://stackoverflow.com/questions/39256287/how-to-get-classes-labels-from-cross-val-predict-used-with-predict-proba-in-scik) and/or [this question](https://stackoverflow.com/questions/16858652/how-to-find-the-corresponding-class-in-clf-predict-proba) to get the class labels, and `np.argmax()` to get the index of the highest probability in each row – G. Anderson Sep 09 '19 at 19:46

0 Answers0