I am just a beginner in ML and try to understand what exactly is the advantage of (Stratified) KFold over the classic train_test_split.
The classic train_test_split uses exactly one part for training (in this case 75%) and one part for testing (in this case 25%). Here I know exactly which data points are used for training and testing (see code)
When splitting with the (Stratified) Kfold we use 4 splits with the result that we have 4 different training/test parts. For me it is not clear which of the 4 parts will be used for training/testing the Logistic Regression. Does it make any sense to set this split this way? As far as I understood it, the advantage of (Stratified) Kfold is that you can use all data for training. How would I have to change the code to achieve this?
Creating Data
import pandas as pd
import numpy as np
target = np.ones(25)
target[-5:] = 0
df = pd.DataFrame({'col_a':np.random.random(25),
'target':target})
df
train_test_split
from sklearn.model_selection import train_test_split
X = df.col_a
y = df.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle=True)
print("TRAIN:", X_train.index, "TEST:", X_test.index)
Output:
TRAIN: Int64Index([1, 13, 8, 9, 21, 12, 10, 4, 20, 19, 7, 5, 15, 22, 24, 17, 11, 23], dtype='int64')
TEST: Int64Index([2, 6, 16, 0, 14, 3, 18], dtype='int64')
Stratified KFold
from sklearn.model_selection import StratifiedKFold
X = df.col_a
y = df.target
skf = StratifiedKFold(n_splits=4)
for train_index, test_index in skf.split(X, y):
X_train, X_test = X.loc[train_index], X.loc[test_index]
y_train, y_test = y.loc[train_index], y.loc[test_index]
print("TRAIN:", train_index, "TEST:", test_index)
Output:
TRAIN: [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 22 23 24] TEST: [ 0 1 2 3 4 20 21]
TRAIN: [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 23 24] TEST: [ 5 6 7 8 9 22]
TRAIN: [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 24] TEST: [10 11 12 13 14 23]
TRAIN: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22 23] TEST: [15 16 17 18 19 24]
Using Logistic Regression
from sklearn.linear_model import LogisticRegression
X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)
clf = LogisticRegression()
clf.fit(X_train, y_train)
clf.predict(X_test)