How to determine the positive class in roc_auc_score?

Question

I want to use roc_auc_score to evaluate the performance of the classifier, but I'm not sure what is the right parameters to give it.

This is a description of this function in the documentation: documentation.

As you can see, it needs y_score, which is the probability estimates of the positive class, but how to determine which class is positive? For example, when I use predict_proba, which column should I use?

Now the way I use this function is as follows:

       clf = SVC(                    
            kernel = 'linear',
            probability = True,  
            random_state = 1 )

       clf.fit(train,train_Labels)

       score = np.array(clf.predict_proba(test_values))
       auc = roc_auc_score(test_Labels,score[:,1])

train_Labels and test_Labels are one-dimensional vectors with 0 in front and 1 behind：[0,0,0,1,1,1].

In train and test, one row represents a sample, and one column represents a feature.

It might not be appropriate to use predict_proba, but there are special requirements in my project, so don't worry.

I want to know if the vectors I passed into the roc_auc_score function as a positive probability is correct(y_true and y_score).

If there is anything unclear about the question, please ask me, I am a novice, please forgive me.

score 1 · Answer 1 · answered Dec 08 '20 at 19:42

I have one shortcut and one answer:

The shortcut.

You may use the function make_scorer from sklearn to create a more "robust" roc_auc_score. Here is the complete example :

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import numpy

X, y = make_classification(n_samples=1000, n_classes=2, random_state=1)
model = LogisticRegression(solver='lbfgs')
model.fit(X, y)

model.predict_proba(X)

results in :

array([[0.2022794 , 0.7977206 ],
       [0.78449699, 0.21550301],
       [0.87371492, 0.12628508],
       ...,
       [0.19976995, 0.80023005],
       [0.00463778, 0.99536222],
       [0.93405707, 0.06594293]])

You can create your own roc_auc_score by doing:

from sklearn.metrics import make_scorer
roc_auc_scorer = make_scorer(
    roc_auc_score, 
    needs_proba=True,
)

Therefore

roc_auc_scorer(model, X, y)

returns

0.9246156984627939

Now the more formal answer

From sklearn documentation :

y_score

[ ... ] scores must be the scores of the class with the greater label

(source https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html)

and predict_proba returns :

T array-like of shape (n_samples, n_classes)

Returns the probability of the sample for each class in the model, where classes are ordered as they are in self.classes_.

(source LogisticRegression, might not be true for every clf ! https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba)

Given that, from my understanding, you can determine "the greater label" and therefore which column you should use. From the previous example :

and:

model.classes_

has been set to:

array([0, 1])

So using numpy.argmax you can get the position of the class with the greater label and get the roc_auc_score:

_idx = numpy.argmax(model.classes_)
_p = model.predict_proba(X)[:,_idx]
roc_auc_score(y, _p)

results:

0.9246156984627939

I hope this may help !

score -1 · Answer 2 · answered Jan 08 '20 at 11:09

-1

As I understand correctly you want to get the roc_auc_score for your binary classification problem.

There is no need for you to rearrange the score function, you can only use:

auc = roc_auc_score(test_labels, score)

As based on the documentation:

y_score array-like of shape (n_samples,) or (n_samples, n_classes)
Target scores. In the binary and multilabel cases, these can be either probability estimates or non-thresholded decision values (as
returned by decision_function on some classifiers). In the multiclass case, these must be probability estimates which sum to 1. The binary case expects a shape (n_samples,), and the scores must be the scores of the class with the greater label.

Which fits to your problem.

The shape of test.labels.shapeand score.shapeshould match.

answered Jan 08 '20 at 11:09

PV8

5,799
7
43
87

1

,thank you for your reply.In addition, I haven't rearranged the labels, the training set and the test set class labels were originally 0 followed by 1.So there's no problem with my usage, is there?I think this is the right way to use it:` clf = SVC( kernel = 'linear', probability = True, random_state = 1 ) clf.fit(train,train_Labels) score = np.array(clf.decision_function(test_values)) auc = roc_auc_score(test_Labels,score)` Do you agree with me? – jiayeah Jan 09 '20 at 10:21
try it, it should be fine – PV8 Jan 09 '20 at 12:32

How to determine the positive class in roc_auc_score?

2 Answers2

The shortcut.

Now the more formal answer