Where should I define sklearn model in a kfold validation setup?

Question

I am a novice in Machine Learning and I have confusion in K-fold cross-validation. When I write a fold for loop where exactly should I define the sklearn model (not the PyTorch model). I have seen some tutorial where they define the model inside the fold for loop and predict on X_validation using the same model. But then we will be defining the k-different model inside that for loop and the final model will be the one that is trained on only the last fold, it does not have any link with any of the previous fold.

In my opinion we should define a Scikitlearn-model outside Kfold cross-validation, please explain to me if I am thinking right or is there any data leakage problem associate with this approach?

Below is the implementation I am using in my project, Here I have defined the sklearn-model inside kfold for loop.

import pandas as pd
from sklearn import linear_model
from sklearn import metrics

import config 
from lr_pipeline import lr_pipe

def run_training(fold):
    # load a single fold here
    df = pd.read_csv(config.TRAIN_FOLDS)
    df_train = df[df.kfold != fold].reset_index(drop=True)
    df_val = df[df.kfold == fold].reset_index(drop=True)

    # get X_train, X_val, y_train and y_val
    X_train = df_train.drop(['id','target_class','kfold'],axis=1)
    y_train = df_train['target_class']

    X_val = df_val.drop(['id','target_class','kfold'],axis=1)
    y_val = df_val['target_class']

    # preprocessing pipeline here
    X_train = lr_pipe.fit_transform(X_train)
    X_val = lr_pipe.transform(X_val)

    # train clf
    clf = linear_model.LogisticRegression()
    clf.fit(X_train,y_train)

    # metric
    pred = clf.predict_proba(X_val)[:,1]
    auc = metrics.roc_auc_score(y_val,pred)
    print(f"fold={fold}, auc={auc}")

    df_val.loc[:,"lr_pred"] = pred
    return df_val[["id","kfold","target_class","lr_pred"]]

if __name__ == '__main__':
    dfs = []
    for i in range(5):
        temp_df = run_training(i)
        dfs.append(temp_df)
    fin_valid_df = pd.concat(dfs)


    print(fin_valid_df.shape)
    fin_valid_df.to_csv(config.LR_MODEL_PRED,index=False)

score 0 · Answer 1 · answered Feb 27 '21 at 12:41

Let me start with a short background. Most of the machine learning models have two sets of parameters:

First, so-called hyper-parameters, e.g. for Linear Regression - regularization coefficient alpha, for Decision Tree - depth of the tree, for K nearest neighbors (KNN) - number of neighbors, etc.

Second, there are parameters of the model, e.g. for Linear Regression - weights (w and b in X w + b), for Decision Tree - specific split at each tree level, and KNN is a rare case of a model without any parameters.

Model parameters are estimated via a learning algorithm (this is what happens when you type model.fit(X, y)), however hyper-parameters are not. Hyper-parameters are defined by a user. The question is, how to choose them, and the answer is cross-validation. It could be a k-fold cross-validation, or any other validation technique, such as train-test-splitting or random-shuffle or others.

So, regarding your original question. While hyper-parameters of two models trained on different folds may be the same, the parameters will be different, since parameters are derived by a learning algorithm from the training data. Thus, it does not really matter whether you create your model inside for-loop or outside, models trained on different folds will be independent and different. But the goal of cross-validation is not to train a model, but to define a best set of hyper-parameters.

However, more and more often I see how people average models trained on different fold without retraining a model on a whole train set, thus decreasing the variance of a prediction, in that case it could be useful to define model inside a for loop and save it instance to use it later, like this:

def run_training(fold):
    ...
    clf = linear_model.LogisticRegression()
    clf.fit(X_train,y_train)
    pred = clf.predict_proba(X_val)[:,1]
    return clf

I recommend you to stick to this approach, and do something like this:

clfs = []
aucs_train = []
aucs_val = []
for train, val in kf.split(X, y):
    clf = run_training(train)
    clfs.append(clf)
    y_pred_train = clf.predict(X[train])[:, 1]
    y_pred_val = clf.predict(X[val])[:, 1]
    aucs_train.append(auc(y[train], y_pred_train))
    aucs_val.append(auc(y[val], y_pred_val))

In that case clfs will contain classifier trained on different folds, and you could use them both for validation and for predicting on hold-out/test set.

Where should I define sklearn model in a kfold validation setup?

1 Answers1