Splitting a data set for K-fold Cross Validation in Sci-Kit Learn

Question

I was assigned a task that requires creating a Decision Tree Classifier and determining the accuracy rates using the training set and 10-fold cross-validation. I went over the documentation for cross_val_predict as I believe that this is the module I am going to need.

What I am having trouble with, is the splitting of the data set. As far as I am aware, in the usual case, the train_test_split() method is used to split the data set into 2 - the train and the test. From my understanding, for K-fold validation you need to further split the train set into K-number of parts.

My question is: do I need to split the data set at the beginning into train and test, or not?

Yohann L. · Accepted Answer · 2019-11-12T16:25:01.660

4

It depends. My personal opinion is yes you have to split your dataset into training and test set, then you can do a cross-validation on your training set with K-folds. Why ? Because it is interesting to test after your training and fine-tuning your model on unseen example.

But some guys just do a cross-val. Here is the workflow I often use:

# Data Partition
X_train, X_valid, Y_train, Y_valid = model_selection.train_test_split(X, Y, test_size=0.2, random_state=21)

# Cross validation on multiple model to see which models gives the best results
print('Start cross val')
cv_score = cross_val_score(model, X_train, Y_train, scoring=metric, cv=5)
# Then visualize the score you just obtain using mean, std or plot
print('Mean CV-score : ' + str(cv_score.mean()))

# Then I tune the hyper parameters of the best (or top-n best) model using an other cross-val
for param in my_param:
    model = model_with_param
    cv_score = cross_val_score(model, X_train, Y_train, scoring=metric, cv=5)
    print('Mean CV-score with param: ' + str(cv_score.mean()))

# Now I have best parameters for the model, I can train the final model
model = model_with_best_parameters
model.fit(X_train, y_train)

# And finally test your tuned model on the test set
y_pred = model.predict(X_test)
plot_or_print_metric(y_pred, y_test)

edited Nov 12 '19 at 16:25

answered Nov 12 '19 at 15:28

Yohann L.

1,262
13
27

So, as far as I understand, ```cross_val_score``` performs both ```model.fit``` (on the K-1 folds) and ```model.predict``` (on the last fold) for each iteration (from 1 to K)? – InNeedOfaName Nov 12 '19 at 16:18
Nop, the workflow I write is more to find the model I will use for my problem. I use `cross_val_score` to train on `K-1 fold` and test it on the `K-th` fold. Thus, I have K-score, it give a good representation of how the model perform on your dataset. Then, I use an other cross_val_score to find the best parameters for the model I want to use. When it's done, I do a last training with the whole training set and then, I can test it with my test set. Is it more clear ? – Yohann L. Nov 12 '19 at 16:21
Yes, what I meant by my comment is that the train on K-1 fold you mentioned, in the beginning, is similar to the fit on the simple train set, while the test on the K-th fold is similar to the predict on the simple test set. I may have understood it entirely wrong though! – InNeedOfaName Nov 12 '19 at 16:32
Hi, can you please answer me this: in your code you mention an object called 'my_param'. What is this? Do you somehow get it from cross_val_score? – thenac Dec 17 '20 at 09:01
@thenac it will be the hyper-parameters you want to test with your model. Search on the internet how to fine-tune hyper-parameters, there is plenty of tutorial :) – Yohann L. Dec 17 '20 at 09:25
Thanks for the answer! Ah so it's a set of parameters you want to test right? You have predefined them, it doesnt come from cross_val_score(). Its like your own gridsearch – thenac Dec 17 '20 at 11:56
Yes exactly! It can definitely be a grid_parameters like : `param_grid = [ {'C': [1, 10, 100, 1000], 'kernel': ['linear']}, {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']}, ]` – Yohann L. Dec 17 '20 at 14:20

seralouk · Answer 2 · 2019-11-12T15:50:45.340

1

Short answer: NO

Long answer. If you want to use K-fold validation when you do not usually split initially into train/test.

There are a lot of ways to evaluate a model. The simplest one is to use train/test splitting, fit the model on the train set and evaluate using the test.

If you adopt a cross-validation method, then you directly do the fitting/evaluation during each fold/iteration.

It's up to you what to choose but I would go with K-Folds or LOOCV.

K-Folds procedure is summarised in the figure (for K=5):

edited Nov 12 '19 at 15:50

answered Nov 12 '19 at 15:45

seralouk

30,938
9
118
133

1

Thank you for your reply! Yes, I understand the concept of the K-folds, but as I was looking into it I stumbled upon this image which confused me about the final evaluation test data. I suppose that it's just a combination of the K-folds (the middle part of the image) and the simple train-test split (for the test data in yellow at the bottom) (https://scikit-learn.org/stable/_images/grid_search_cross_validation.png) – InNeedOfaName Nov 12 '19 at 16:26

Splitting a data set for K-fold Cross Validation in Sci-Kit Learn

2 Answers2

Linked