0

I use the code to run cross validation, returning ROC scores.

rf = RandomForestClassifier(n_estimators=1000,oob_score=True,class_weight  = 'balanced') 
scores = cross_val_score ( rf, X,np.ravel(y), cv=10, scoring='roc_auc')

How can I return the ROC based on

roc_auc_score(y_test,results.predict(X_test))  

rather than

roc_auc_score(y_test,results.predict_proba(X_test))  
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
LUSAQX
  • 377
  • 2
  • 6
  • 19
  • ROC AUC is only useful if you can rank order your predictions. Using `.predict()` will just give the most probable class for each sample, and so you won't be able to do that rank ordering. – Randy Dec 07 '16 at 00:11

1 Answers1

1

ROC AUC is only useful if you can rank order your predictions. Using .predict() will just give the most probable class for each sample, and so you won't be able to do that rank ordering.

In the example below, I fit a random forest on a randomly generated dataset and tested it on a held out sample. The blue line shows the proper ROC curve done using .predict_proba() while the green shows the degenerate one with .predict() where it only really knows of the one cutoff point.

from sklearn.datasets import make_classification
from sklearn.metrics import roc_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

rf = RandomForestClassifier()

data, target = make_classification(n_samples=4000, n_features=2, n_redundant=0, flip_y=0.4)
train, test, train_t, test_t = train_test_split(data, target, train_size=0.9)

rf.fit(train, train_t)

plt.plot(*roc_curve(test_t, rf.predict_proba(test)[:,1])[:2])
plt.plot(*roc_curve(test_t, rf.predict(test))[:2])
plt.show()

enter image description here

EDIT: While there's nothing stopping you from calculating an roc_auc_score() on .predict(), the point of the above is that it's not really a useful measurement.

In [5]: roc_auc_score(test_t, rf.predict_proba(test)[:,1]), roc_auc_score(test_t, rf.predict(test))
Out[5]: (0.75502749115010925, 0.70238005573548234) 
Randy
  • 14,349
  • 2
  • 36
  • 42
  • Thanks. But I concern the ROC score rather than ROC curve. So I wanna get roc_auc_score(y_test,results.predict(X_test)) – LUSAQX Dec 07 '16 at 20:08
  • @LUSAQX there is no such thing as ROC score, do you mean AUC (area under the curve?) – Calimo Dec 07 '16 at 21:38
  • I mean roc_auc_score(). – LUSAQX Dec 07 '16 at 21:43
  • `roc_auc_score()` is just the area underneath an ROC curve. You can easily calculate the area under that green curve with `roc_auc_score()`, but the point of my answer is that it's going to be essentially a meaningless number since all you really have is a single sensitivity/specificity measurement by using `.predict()` – Randy Dec 07 '16 at 22:27
  • What does the `*` do in `roc_curve()`? – Chris Sep 10 '20 at 18:54
  • 1
    @Chris `roc_curve` returns 3 elements, and I want to pass the first two to `plt.plot`. The `[:2]` picks off the first two elements, and the `*` is the argument unpacking operator to pass those as two separate inputs to `plt.plot` (see e.g., https://www.geeksforgeeks.org/packing-and-unpacking-arguments-in-python/ for more detail on how that works). – Randy Sep 11 '20 at 15:13