0

I have a pickle file that when loaded returns a trained RandomForest classifier. I want to plot the ROC curve for the classes, but from what I read online, the classifier must be wrapped in scikit learn's OneVsRestClassifier. The problem is that since I already have the trained model I cannot wrap it in it to fit the model again.

So I would like to know if there is some workaround to plot the ROC curve. From my trained model I have y_test, y_proba. I also have x_test values.

  • The shape of my y_proba examples is: (6715, 5)

enter image description here

  • The shape of y_test is (6715, 5)

enter image description here

This is the output of the code @dx2-66 suggested:

enter image description here

enter image description here

Yana
  • 785
  • 8
  • 23
  • 2
    May https://stackoverflow.com/questions/70278059/plotting-the-roc-curve-for-a-multiclass-problem/70279497#70279497 help? The `RandomForestClassifier` estimator deals natively with multiclass problems without the need of being wrapped in a `OneVsRestClassifier`. – amiola Aug 25 '22 at 08:49
  • 1
    technically, your saved model is already following one vs rest. Btw ROC curve makes sense for binary classification, it's not easily interpretable in multiclass. – Erwan Aug 25 '22 at 11:11

1 Answers1

1

I assume your y_test is single column with class id, and your y_proba has as much columns as there are classes (at least that's what you'd usually get from predict_proba().

How about this? It should yield you OvR-style curves:

from sklearn.metrics import roc_curve
from sklearn.preprocessing import label_binarize
import matplotlib.pyplot as plt

classes = range(y_proba.shape[1])

for i in classes:
    fpr, tpr, _ = roc_curve(label_binarize(y_test, classes=classes)[:,i], y_proba[:,i])
    plt.plot(fpr, tpr, alpha=0.7)
    plt.legend(classes)

Update: solution for non-monotonic class labels:

classes = sorted(list(y_test['label'].unique()))

plt.plot([0, 1], linestyle='--')

for i in range(len(classes)):
    fpr, tpr, _ = roc_curve(label_binarize(y_test, classes=classes)[:,i], y_proba.values[:,i])
    plt.plot(fpr, tpr, alpha=0.7)
    plt.legend(['baseline']+classes) # Fixed the baseline legend
dx2-66
  • 2,376
  • 2
  • 4
  • 14
  • Does this work on the principle of one class vs rest? I get a very strange plot. Not looking as expected. – Yana Aug 25 '22 at 10:58
  • 1
    Yes, this is one vs rest approach. Would you kindly share a sample of `y_proba` and `y_test`? – dx2-66 Aug 25 '22 at 11:04
  • I have added examples to my question. – Yana Aug 25 '22 at 11:22
  • 1
    So apparently you've got classes labeled up to 7 but two of those aren't present? In this case classes = [2,3,5,6,7] # or whichever are actually present; for i in range(len(classes)): ... should work regardless. – dx2-66 Aug 25 '22 at 12:21
  • Well, yes, two of the classes I dropped because they had very few examples (less than 50 each) that I could train. But I have added a screen shot of the result I got from your code. It really looks strange to me. And also there is usually a diagonal on the chart over which the results are placed. – Yana Aug 25 '22 at 12:30
  • 1
    You can add the line with `plt.plot([0, 1], linestyle='--')` before the loop. Does the adjustment from the previous comment help? – dx2-66 Aug 25 '22 at 12:50
  • I have shared above the screenshot of the result after the modification. As I said, it looks strange. I have tried to binarize the result before I split the data, but then it is even crazier result. See the second chart. – Yana Aug 25 '22 at 13:06
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/247558/discussion-between-dx2-66-and-yana). – dx2-66 Aug 25 '22 at 13:13
  • Probably the classifier's `classes_` attribute would work in place of the sorted list of labels, and would be a little more intrinsic? – Ben Reiniger Aug 25 '22 at 16:52