How to nest LabelKFold?

Question

I have a dataset with ~300 points and 32 distinct labels and I want to evaluate a LinearSVR model by plotting its learning curve using grid search and LabelKFold validation.

The code I have looks like this:

import numpy as np
from sklearn import preprocessing
from sklearn.svm import LinearSVR
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import LabelKFold
from sklearn.grid_search import GridSearchCV
from sklearn.learning_curve import learning_curve
    ...
#get data (x, y, labels)
    ...
C_space = np.logspace(-3, 3, 10)
epsilon_space = np.logspace(-3, 3, 10)  

svr_estimator = Pipeline([
    ("scale", preprocessing.StandardScaler()),
    ("svr", LinearSVR),
])

search_params = dict(
    svr__C = C_space,
    svr__epsilon = epsilon_space
)

kfold = LabelKFold(labels, 5)

svr_search = GridSearchCV(svr_estimator, param_grid = search_params, cv = ???)

train_space = np.linspace(.5, 1, 10)
train_sizes, train_scores, valid_scores = learning_curve(svr_search, x, y, train_sizes = train_space, cv = ???, n_jobs = 4)
    ...
#plot learning curve

My question is how to setup the cv attribute for the grid search and learning curve so that it will break my original set into training and test sets that don't share any labels for computing the learning curve. And then from those training sets, further separate them into training and test sets without sharing labels for the grid search?

Essentially, how do I run a nested LabelKFold?

I, the user who created the bounty for this question, wrote the following reproducible example using data available from sklearn.

import numpy as np
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, roc_auc_score
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import cross_val_score, LabelKFold

digits = load_digits()
X = digits['data']
Y = digits['target']
Z = np.zeros_like(Y) ## this is just to make a 2-class problem, purely for the sake of an example
Z[np.where(Y>4)]=1

strata = [x % 13 for x in xrange(Y.size)] # define the strata for use in

## define stuff for nested cv...
mtry = [5, 10]
tuned_par = {'max_features': mtry}
toy_rf = RandomForestClassifier(n_estimators=10, max_depth=10, random_state=10,
                                class_weight="balanced")
roc_auc_scorer = make_scorer(roc_auc_score, needs_threshold=True)

## define outer k-fold label-aware cv
outer_cv = LabelKFold(labels=strata, n_folds=5)

#############################################################################
##  this works: using regular randomly-allocated 10-fold CV in the inner folds
#############################################################################
vanilla_clf = GridSearchCV(estimator=toy_rf, param_grid=tuned_par, scoring=roc_auc_scorer,
                        cv=5, n_jobs=1)
vanilla_results = cross_val_score(vanilla_clf, X=X, y=Z, cv=outer_cv, n_jobs=1)

##########################################################################
##  this does not work: attempting to use label-aware CV in the inner loop
##########################################################################
inner_cv = LabelKFold(labels=strata, n_folds=5)
nested_kfold_clf = GridSearchCV(estimator=toy_rf, param_grid=tuned_par, scoring=roc_auc_scorer,
                                cv=inner_cv, n_jobs=1)
nested_kfold_results = cross_val_score(nested_kfold_clf, X=X, y=Y, cv=outer_cv, n_jobs=1)

score 3 · Accepted Answer · answered Sep 07 '16 at 15:54

From your question, you are looking for the LabelKFold score on your data, while grid-searching the parameters of your pipeline in each of the iterations of this outer LabelKFold, using again a LabelKFold. Although I was not able to achieve that out-of-the-box it takes only one loop:

outer_cv = LabelKFold(labels=strata, n_folds=3)
strata = np.array(strata)
scores = []
for outer_train, outer_test in outer_cv:
    print "Outer set. Train:", set(strata[outer_train]), "\tTest:", set(strata[outer_test])
    inner_cv = LabelKFold(labels=strata[outer_train], n_folds=3)
    print "\tInner:"
    for inner_train, inner_test in inner_cv:
        print "\t\tTrain:", set(strata[outer_train][inner_train]), "\tTest:", set(strata[outer_train][inner_test])
    clf = GridSearchCV(estimator=toy_rf, param_grid=tuned_par, scoring=roc_auc_scorer, cv= inner_cv, n_jobs=1)
    clf.fit(X[outer_train],Z[outer_train])
    scores.append(clf.score(X[outer_test], Z[outer_test]))

Running the code, the first iteration yields:

Outer set. Train: set([0, 1, 4, 5, 7, 8, 10, 11])   Test: set([9, 2, 3, 12, 6])
Inner:
    Train: set([0, 10, 11, 5, 7])   Test: set([8, 1, 4])
    Train: set([1, 4, 5, 8, 10, 11])    Test: set([0, 7])
    Train: set([0, 1, 4, 8, 7])     Test: set([10, 11, 5])

Hence, it is easy to verify that it executes as intended. Your cross-validation scores are in the list scores and you can easily process them. I have used the variables, e.g., strata you defined in your last piece of code.

This is pretty much how I had to do it, doing the kfold loop on my own and running the grid search on individual folds. I'm the original questioner, but I'm not the one who put a bounty on this question. I'm not sure how that works, but I'm going to upvote this answer because it is best solution that I know of. However, I'm going to wait for the bounty holder's response before accepting the answer. — Alex, Sep 07 '16 at 17:09
This looks very viable -- I'll give this a whirl. It would appear that this is the right way to do it. Thank you very much. FWIW, @Alex, I believe that only I can award the bounty, so geompalik can look forward to that in the next 24 hours. — Sycorax, Sep 07 '16 at 19:30

How to nest LabelKFold?

1 Answers1