1

I want to use roc_auc_score to evaluate the performance of the classifier, but I'm not sure what is the right parameters to give it.

This is a description of this function in the documentation: documentation.

As you can see, it needs y_score, which is the probability estimates of the positive class, but how to determine which class is positive? For example, when I use predict_proba, which column should I use?

Now the way I use this function is as follows:

       clf = SVC(                    
            kernel = 'linear',
            probability = True,  
            random_state = 1 )

       clf.fit(train,train_Labels)

       score = np.array(clf.predict_proba(test_values))
       auc = roc_auc_score(test_Labels,score[:,1])

train_Labels and test_Labels are one-dimensional vectors with 0 in front and 1 behind:[0,0,0,1,1,1].

In train and test, one row represents a sample, and one column represents a feature.

It might not be appropriate to use predict_proba, but there are special requirements in my project, so don't worry.

I want to know if the vectors I passed into the roc_auc_score function as a positive probability is correct(y_true and y_score).

If there is anything unclear about the question, please ask me, I am a novice, please forgive me.

Liuk
  • 351
  • 1
  • 13
jiayeah
  • 29
  • 2

2 Answers2

1

I have one shortcut and one answer:

The shortcut.

You may use the function make_scorer from sklearn to create a more "robust" roc_auc_score. Here is the complete example :

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import numpy

X, y = make_classification(n_samples=1000, n_classes=2, random_state=1)
model = LogisticRegression(solver='lbfgs')
model.fit(X, y)
model.predict_proba(X)

results in :

array([[0.2022794 , 0.7977206 ],
       [0.78449699, 0.21550301],
       [0.87371492, 0.12628508],
       ...,
       [0.19976995, 0.80023005],
       [0.00463778, 0.99536222],
       [0.93405707, 0.06594293]])

You can create your own roc_auc_score by doing:

from sklearn.metrics import make_scorer
roc_auc_scorer = make_scorer(
    roc_auc_score, 
    needs_proba=True,
)

Therefore

roc_auc_scorer(model, X, y)

returns

0.9246156984627939

Now the more formal answer

From sklearn documentation :

y_score

[ ... ] scores must be the scores of the class with the greater label

(source https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html)

and predict_proba returns :

T array-like of shape (n_samples, n_classes)

Returns the probability of the sample for each class in the model, where classes are ordered as they are in self.classes_.

(source LogisticRegression, might not be true for every clf ! https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba)

Given that, from my understanding, you can determine "the greater label" and therefore which column you should use. From the previous example :

and:

model.classes_

has been set to:

array([0, 1])

So using numpy.argmax you can get the position of the class with the greater label and get the roc_auc_score:

_idx = numpy.argmax(model.classes_)
_p = model.predict_proba(X)[:,_idx]
roc_auc_score(y, _p)

results:

0.9246156984627939

I hope this may help !

-1

As I understand correctly you want to get the roc_auc_score for your binary classification problem.

There is no need for you to rearrange the score function, you can only use:

auc = roc_auc_score(test_labels, score)

As based on the documentation:

y_score array-like of shape (n_samples,) or (n_samples, n_classes)

Target scores. In the binary and multilabel cases, these can be either probability estimates or non-thresholded decision values (as

returned by decision_function on some classifiers). In the multiclass case, these must be probability estimates which sum to 1. The binary case expects a shape (n_samples,), and the scores must be the scores of the class with the greater label.

Which fits to your problem.

The shape of test.labels.shapeand score.shapeshould match.

PV8
  • 5,799
  • 7
  • 43
  • 87
  • 1
    ,thank you for your reply.In addition, I haven't rearranged the labels, the training set and the test set class labels were originally 0 followed by 1.So there's no problem with my usage, is there?I think this is the right way to use it:` clf = SVC( kernel = 'linear', probability = True, random_state = 1 ) clf.fit(train,train_Labels) score = np.array(clf.decision_function(test_values)) auc = roc_auc_score(test_Labels,score)` Do you agree with me? – jiayeah Jan 09 '20 at 10:21
  • try it, it should be fine – PV8 Jan 09 '20 at 12:32