How to split data into test and train after applying stratified k-fold cross validation?

Question

I have already assigned columns to their specific k-fold using the following code:

from sklearn.model_selection import StratifiedKFold, train_test_split

# Stratified K-fold cross-validation 
df['kfold'] = -1
df = df.sample(frac=1).reset_index(drop=True)
y = df.quality
kf = StratifiedKFold(n_splits=5)

for f, (t_,v_) in enumerate(kf.split(X=df, y=y)):
  df.loc[v_, 'kfold'] = f

Now the dataframe is as expected:


        alcohol  volatile acidity   sulphates citric acid   quality kfold
1499    10.9            0.36          0.73        0.39          6   4
1500    9.5             0.65          0.55        0.10          5   4
1501    13.4            0.44          0.66        0.68          6   4
1502    9.6             0.59          0.67        0.24          5   4
1503    13.0            0.53          0.77        0.79          5   4

But how do I split it into train and test split?

score 1 · Accepted Answer · answered Aug 18 '20 at 02:56

StratifiedKFold will split the dataframe into a number of folds and return the training/test indices. Each fold will have one part for testing (of size len(data)/n) and the rest will be used for training.

In your for loop, you can access the train and test sets as follows:

for f, (t_,v_) in enumerate(kf.split(X=df, y=y)):
  df_train = df.loc[t_]
  df_test = df.loc[v_]

As you can see the kfold column you added labels the testing data. The rest of the data should be used for training for this fold. I.e., for kfold == 1 the training data is all other data (kfold != 1).

How to split data into test and train after applying stratified k-fold cross validation?

1 Answers1