What is the difference between cross_val_score with scoring='roc_auc' and roc_auc_score?

Question

I am confused about the difference between the cross_val_score scoring metric 'roc_auc' and the roc_auc_score that I can just import and call directly.

The documentation (http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) indicates that specifying scoring='roc_auc' will use the sklearn.metrics.roc_auc_score. However, when I implement GridSearchCV or cross_val_score with scoring='roc_auc' I receive very different numbers that when I call roc_auc_score directly.

Here is my code to help demonstrate what I see:

# score the model using cross_val_score

rf = RandomForestClassifier(n_estimators=150,
                            min_samples_leaf=4,
                            min_samples_split=3,
                            n_jobs=-1)

scores = cross_val_score(rf, X, y, cv=3, scoring='roc_auc')

print scores
array([ 0.9649023 ,  0.96242235,  0.9503313 ])

# do a train_test_split, fit the model, and score with roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
rf.fit(X_train, y_train)

print roc_auc_score(y_test, rf.predict(X_test))
0.84634039111363313 # quite a bit different than the scores above!

I feel like I am missing something very simple here -- most likely a mistake in how I am implementing/interpreting one of the scoring metrics.

Can anyone shed any light on the reason for the discrepancy between the two scoring metrics?

I am also totally confused by this difference. I also tried using the standard make_scorer() function that turn a score function into a correct Scorer object for cross_val_score, but the results are the same. make_scorer() gives the same result as my manual implementation, while 'roc_auc' gives higher scores. Fortunately the difference was several % in my example, unlike your, but still: which function should I trust? — Anton Fetisov, Mar 15 '16 at 14:08

score 13 · Accepted Answer · edited May 23 '17 at 12:34

13

This is because you supplied predicted y's instead of the probability in roc_auc_score. This function takes a score, not the classified label. Try instead to do this:

print roc_auc_score(y_test, rf.predict_proba(X_test)[:,1])

It should give a similar result to previous result from cross_val_score. Refer to this post for more info.

edited May 23 '17 at 12:34

Community

1
1

answered Apr 29 '16 at 15:42

George Liu

3,601
10
43
69

3

You are completely right! I'd laugh if I could stop crying. Thanks! – MichaelHood May 04 '16 at 22:59

score 6 · Answer 2 · edited May 23 '17 at 12:34

6

I just ran into a similar issue here. The key takeaway there was that cross_val_score uses the KFold strategy with default parameters for making the train-test splits, which means splits into consecutive chunks rather than shuffling. train_test_split on the other hand does a shuffled split.

The solution is to make the split strategy explicit and specify shuffling, like this:

shuffle = cross_validation.KFold(len(X), n_folds=3, shuffle=True)
scores = cross_val_score(rf, X, y, cv=shuffle, scoring='roc_auc')

edited May 23 '17 at 12:34

Community

1
1

answered Nov 11 '15 at 01:04

Aniket Schneider

904
1
8
21

1

Aniket, thanks for the answer. But specifying the folds and passing them into cross_val_score did not address the discrepancy between the scoring metrics. – MichaelHood Nov 13 '15 at 03:16
I know I'm very late to this, but when you use the `cross_val_score` method with the `roc_auc` scoring, why do you pass it the predicted class labels instead of predicted probabilities? Since it's AUC, doesn't it need probabilities instead, so it can test different threshold values? – NeonBlueHair May 17 '17 at 23:11

score 1 · Answer 3 · answered Oct 25 '16 at 16:05

Ran into this problem myself and after digging a bit found the answer. Sharing for the love.

There is actually two and a half problems.

you need to use the same Kfold to compare scores (the same split of the train/test);
you need to feed the probabilities into the roc_auc_score (using the predict_proba() method). BUT, some estimators (like SVC) does not have a predict_proba() method, you then use the decision_function() method.

Here's a full example:

# Let's use the Digit dataset
digits = load_digits(n_class=4)
X,y = digits.data, digits.target
y[y==2] = 0 # Increase problem dificulty
y[y==3] = 1 # even more

Using two estimators

LR = LogisticRegression()
SVM = LinearSVC()

Split the train/test set. But keep it into a variable we can reuse.

fourfold = StratifiedKFold(n_splits=4, random_state=4)

Feed it to GridSearchCV and save scores. Note we are passing fourfold.

gs = GridSearchCV(LR, param_grid={}, cv=fourfold, scoring='roc_auc', return_train_score=True)
gs.fit(X,y)
gs_scores = np.array([gs.cv_results_[k][0] for k in gskeys])

Feed it to cross_val_score and save scores.

 cv_scores = cross_val_score(LR, X, y, cv=fourfold, scoring='roc_auc')

Sometimes, you want to loop and compute several different scores, so this is what you use.

loop_scores = list()
for idx_train, idx_test in fourfold.split(X, y):
  X_train, y_train, X_test, y_test = X[idx_train], y[idx_train], X[idx_test], y[idx_test]
  LR.fit(X_train, y_train)
  y_prob = LR.predict_proba(X_test)
  auc = roc_auc_score(y_test, y_prob[:,1])
  loop_scores.append(auc)

Do we have the same scores across the board?

print [((a==b) and (b==c)) for a,b,c in zip(gs_scores,cv_scores,loop_scores)]
>>> [True, True, True, True]

BUT, sometimes our estimator does not have a predict_proba() method. So, according to this example, we do this:

for idx_train, idx_test in fourfold.split(X, y):
  X_train, y_train, X_test, y_test = X[idx_train], y[idx_train], X[idx_test], y[idx_test]
  SVM.fit(X_train, y_train)
  y_prob = SVM.decision_function(X_test)
  prob_pos = (y_prob - y_prob.min()) / (y_prob.max() - y_prob.min())
  auc = roc_auc_score(y_test, prob_pos)

What is the difference between cross_val_score with scoring='roc_auc' and roc_auc_score?

3 Answers3

Linked