I am unclear on how my test set AUC can be so consistently high, but my training set cross validated AUC 'roc_auc' can be so much lower. The more usual situation is the reverse (high training set CV, low test set) due to over-fitting.
Why might my AUC using the test data be quite high (and consistent with a research paper I am using as a benchmark), where my CV AUC much lower?
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
fpr, tpr, thresholds = metrics.roc_curve(y_test, clf.predict_proba(x_test)[:,1])
auc_dt = metrics.auc(fpr,tpr)
print 'roc auc new', metrics.roc_auc_score(y_test, clf.predict_proba(x_test)[:,1])
print 'Test set DT AUC: ', auc_dt
Results are roc auc new 0.883120510099 Test set AUC: 0.883120510099
When I use cross validation
from sklearn.cross_validation import StratifiedKFold
shuffle = StratifiedKFold(y_train, n_folds=10, shuffle=True)
scores = cross_val_score(clf, x_train, y_train, cv=shuffle, scoring='roc_auc')
print scores
print 'Average Training set DT CV score: ',scores.mean()
results [ 0.64501863 0.64880271 0.62380773 0.63231963 0.59982753 0.63169843 0.62608849 0.62264435 0.63381149 0.60471224]
I thought it might be an issue with not knowing how to use predict_proba on the classifier in the context of the cross_val_score, so I used a different method (similar method from the scikit docs):
cv = StratifiedKFold(y_train, n_folds=6, shuffle=True)
classifier = DecisionTreeClassifier()
mean_tpr = 0.0
for i, (train, test) in enumerate(cv):
probas_ = classifier.fit(x_train.values[train], y_train.values[train]).predict_proba(x_train.values[test])
fpr, tpr, thresholds = metrics.roc_curve(y_train.values[test], probas_[:,1])
roc_auc = metrics.auc(fpr, tpr)
print ('roc # %s, %s'%(i,roc_auc))
outcome
- roc # 0, 0.633910529504
- roc # 1, 0.63380692856
- roc # 2, 0.624857088789
- roc # 3, 0.636719967088
- roc # 4, 0.623175499321
- roc # 5, 0.613694032062
More info: The dataset ordered, so I use the shuffle parameter. Without the shuffle parameter, I get results from near 0 to very high (representative of an ordered dataset).
I have been digging into the use of AUC and CV all day, but cannot figure this one out.
There is a similar result with KNeighborsClassifier, where I have a higher AUC using metrics.roc_curve and metrics.auc, but substantially lower CV AUC from the above CV methods.
In case it helps, the confusion matrix on the test set is as follows:
true negative: 3550 false negative: 116 true positive: 335 false positive: 118
Using accuracy as the scorer gets me much better scores for CV.
Any ideas would help.
EDIT: I ran the CV on the test set as well (where the AUC scored high) and I got approximately the same CV AUC as the above (just slightly worse).
I also used a very stripped down version of the script, where I import the data, split the independent from dependent variable, encode the categorical variables with get_dummies, and run the classifier alone and in CV. Same results.
Working hypothesis I believe that the issue has to do with the ordered, stratified nature of the data and issues using cross validation (I just found out that gridsearchCV gives nonsensical results). As I do more research into this, I will add my findings here.