cross Validation in Sklearn using a Custom CV

Question

I am dealing with a binary classification problem.

I have 2 lists of indexes listTrain and listTest, which are partitions of the training set (the actual test set will be used only later). I would like to use the samples associated with listTrain to estimate the parameters and the samples associated with listTest to evaluate the error in a cross validation process (hold out set approach).

However, I am not be able to find the correct way to pass this to the sklearn GridSearchCV.

The documentation says that I should create "An iterable yielding (train, test) splits as arrays of indices". However, I do not know how to create this.

grid_search = GridSearchCV(estimator = model, param_grid = param_grid,cv = custom_cv, n_jobs = -1, verbose = 0,scoring=errorType)

So, my question is how to create custom_cv based on these indexes to be used in this method?

X and y are respectivelly the features matrix and y is the vector of labels.

Example: Supose that I only have one hyperparameter alpha that belongs to the set{1,2,3}. I would like to set alpha=1, estimate the parameters of the model (for instance the coefficients os a regression) using the samples associated with listTrain and evaluate the error using the samples associated with listTest. Then I repeat the process for alpha=2 and finally for alpha=3. Then I choose the alpha that minimizes the error.

Are you attempting to 1) Do cross-validation using multiple folds of your training data and then 2) test generalisation using test set? Because as I understand your question, you want to fit a classifier with specific parameters on `listTrain` and test its performance using `listTest`. But thats just testing different parameter sets on the same problem, and not really cross-validation? — JimmyOnThePage, Jun 07 '19 at 04:44
@JimmyOnThePage supose that I only have one hyperparameter alpha that belongs to the set{1,2,3}. I would like to set alpha=1, estimate the parameters of the model (for instance the coefficients os a regression) using the samples associated with listTrain and evaluate the error using the samples associated with listTest. Then I repeat the process for alpha=2 and finally for alpha=3. Then I choose the alpha that minimizes the error. — DanielTheRocketMan, Jun 07 '19 at 07:56
Please keep in mind that this procedure is wrong (from an ML practice point of view) and highly not recommended; your test set is supposed to be used only once, for the performance evaluation of your **final** model, otherwise you are effectively using it as one more *validation* set (different from the test one). See answers in [Order between using validation, training and test sets](https://stackoverflow.com/questions/54126811/order-between-using-validation-training-and-test-sets). — desertnaut, Jun 07 '19 at 09:37
@desertnaut sorry! I am not using the test set. I divided the training set in two parts using indexes sets listTrain and listTest. The actual test set will only be used later. I am doing this for two reasons: 1) My sample is highly unbalanced. 2) some of my regressors are coming with overfitting and I want to test a different methodology. — DanielTheRocketMan, Jun 07 '19 at 10:28

JimmyOnThePage · Accepted Answer · 2019-06-07T11:37:45.430

EDIT: Actual answer to question. Try passing cv command a generator of the indices:

def index_gen(listTrain, listTest):
    yield listTrain, listTest

grid_search = GridSearchCV(estimator = model, param_grid = 
    param_grid,cv = index_gen(listTrain, listTest), n_jobs = -1, 
    verbose = 0,scoring=errorType)

EDIT: Before Edits:

As mentioned in the comment by desertnaut, what you are trying to do is bad ML practice, and you will end up with a biased estimate of the generalisation performance of the final model. Using the test set in the manner you're proposing will effectively leak test set information into the training stage, and give you an overestimate of the model's capability to classify unseen data. What I suggest in your case:

grid_search = GridSearchCV(estimator = model, param_grid = param_grid,cv = 5, 
    n_jobs = -1, verbose = 0,scoring=errorType)

grid_search.fit(x[listTrain], y[listTrain]

Now, your training set will be split into 5 (you can choose the number here) folds, trained using 4 of those folds on a specific set of hyperparameters, and tested the fold that was left out. This is repeated 5 times, till all of your training examples have been part of a left out set. This whole procedure is done for each hyperparameter setting you are testing (5x3 in this case)

grid_search.best_params_ will give you a dictionary of the parameters that performed the best over all 5 folds. These are the parameters that you use to train your final classifier, using again only the training set:

clf = LogisticRegression(**grid_search.best_params_).fit(x[listTrain], 
    y[listTrain])

Now, finally your classifier is tested on the test set and an unbiased estimate of the generalisation performance is given:

predictions = clf.predict(x[listTest])

thank for your answer. This is exactly what I am doing now. Please see the comment above. I am not using the actual test set. Those indexes sets are partitions of the actual training set. Please see also the question edition. — DanielTheRocketMan, Jun 07 '19 at 10:50
Ah Ok I see now. Will EDIT answer, but gonna keep up for future reference — JimmyOnThePage, Jun 07 '19 at 11:10

cross Validation in Sklearn using a Custom CV

1 Answers1