scikit learn high test set AUC but low training set Cross validates AUC

Question

I am unclear on how my test set AUC can be so consistently high, but my training set cross validated AUC 'roc_auc' can be so much lower. The more usual situation is the reverse (high training set CV, low test set) due to over-fitting.

Why might my AUC using the test data be quite high (and consistent with a research paper I am using as a benchmark), where my CV AUC much lower?

from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

fpr, tpr, thresholds = metrics.roc_curve(y_test, clf.predict_proba(x_test)[:,1]) 
auc_dt = metrics.auc(fpr,tpr)

print 'roc auc new', metrics.roc_auc_score(y_test, clf.predict_proba(x_test)[:,1])
print 'Test set DT AUC: ', auc_dt

Results are roc auc new 0.883120510099 Test set AUC: 0.883120510099

When I use cross validation

from sklearn.cross_validation import StratifiedKFold    
shuffle = StratifiedKFold(y_train, n_folds=10, shuffle=True)
scores = cross_val_score(clf, x_train, y_train, cv=shuffle, scoring='roc_auc')
print scores
print 'Average Training set DT CV score: ',scores.mean()

results [ 0.64501863 0.64880271 0.62380773 0.63231963 0.59982753 0.63169843 0.62608849 0.62264435 0.63381149 0.60471224]

I thought it might be an issue with not knowing how to use predict_proba on the classifier in the context of the cross_val_score, so I used a different method (similar method from the scikit docs):

cv = StratifiedKFold(y_train, n_folds=6, shuffle=True)
classifier = DecisionTreeClassifier()

mean_tpr = 0.0

for i, (train, test) in enumerate(cv):
    probas_ = classifier.fit(x_train.values[train], y_train.values[train]).predict_proba(x_train.values[test])
    fpr, tpr, thresholds = metrics.roc_curve(y_train.values[test], probas_[:,1])
    roc_auc = metrics.auc(fpr, tpr)
    print ('roc # %s, %s'%(i,roc_auc))

outcome

roc # 0, 0.633910529504
roc # 1, 0.63380692856
roc # 2, 0.624857088789
roc # 3, 0.636719967088
roc # 4, 0.623175499321
roc # 5, 0.613694032062

More info: The dataset ordered, so I use the shuffle parameter. Without the shuffle parameter, I get results from near 0 to very high (representative of an ordered dataset).

I have been digging into the use of AUC and CV all day, but cannot figure this one out.

There is a similar result with KNeighborsClassifier, where I have a higher AUC using metrics.roc_curve and metrics.auc, but substantially lower CV AUC from the above CV methods.

In case it helps, the confusion matrix on the test set is as follows:

true negative: 3550 false negative: 116 true positive: 335 false positive: 118

Using accuracy as the scorer gets me much better scores for CV.

Any ideas would help.

EDIT: I ran the CV on the test set as well (where the AUC scored high) and I got approximately the same CV AUC as the above (just slightly worse).

I also used a very stripped down version of the script, where I import the data, split the independent from dependent variable, encode the categorical variables with get_dummies, and run the classifier alone and in CV. Same results.

Working hypothesis I believe that the issue has to do with the ordered, stratified nature of the data and issues using cross validation (I just found out that gridsearchCV gives nonsensical results). As I do more research into this, I will add my findings here.

Maybe this is more an issue for StackExchange CrossValidaion rather than SO itself ? — Moritz, Nov 16 '16 at 21:20
@Moritz, Hi Moritz, initially it was a toss up between the two as I was unsure on whether it was an issue with implementing the algos, or with some underlying statistical properties. as I have done more work on the problem (and came up with my working hyp), I believe it would be better suited for CrossValidation, but do not want to double post, nor do I know how to move this question from SO to CV. I appreciate the suggestion. — ivan7707, Nov 16 '16 at 22:00
Do you have approximately the same number of group members for each group ? Just a thought without digging into to much: Maybe you get a bad representation if you do not use shuffle because the data set is uneven ? Disclaimer: That could be complete nonsense. It just popped into my mind. Hmm, but then it would be (as you said) the other way round — Moritz, Nov 16 '16 at 22:26
@Moritz, Thanks for the thought. I use StratifiedKFold to ensure that there is the same ratio in the dependent variable. — ivan7707, Nov 17 '16 at 15:10
@ivan7707: I am also encountering a similar problem. Did you find a solution at the time? I am working with the R caret package. — Wilkit, Jul 12 '21 at 14:20

scikit learn high test set AUC but low training set Cross validates AUC

0 Answers0