I am a novice in Machine Learning and I have confusion in K-fold cross-validation. When I write a fold for loop where exactly should I define the sklearn model (not the PyTorch model). I have seen some tutorial where they define the model inside the fold for loop and predict on X_validation
using the same model. But then we will be defining the k-different model inside that for loop and the final model will be the one that is trained on only the last fold, it does not have any link with any of the previous fold.
- In my opinion we should define a Scikitlearn-model outside Kfold cross-validation, please explain to me if I am thinking right or is there any data leakage problem associate with this approach?
Below is the implementation I am using in my project, Here I have defined the sklearn-model inside kfold for loop
.
import pandas as pd
from sklearn import linear_model
from sklearn import metrics
import config
from lr_pipeline import lr_pipe
def run_training(fold):
# load a single fold here
df = pd.read_csv(config.TRAIN_FOLDS)
df_train = df[df.kfold != fold].reset_index(drop=True)
df_val = df[df.kfold == fold].reset_index(drop=True)
# get X_train, X_val, y_train and y_val
X_train = df_train.drop(['id','target_class','kfold'],axis=1)
y_train = df_train['target_class']
X_val = df_val.drop(['id','target_class','kfold'],axis=1)
y_val = df_val['target_class']
# preprocessing pipeline here
X_train = lr_pipe.fit_transform(X_train)
X_val = lr_pipe.transform(X_val)
# train clf
clf = linear_model.LogisticRegression()
clf.fit(X_train,y_train)
# metric
pred = clf.predict_proba(X_val)[:,1]
auc = metrics.roc_auc_score(y_val,pred)
print(f"fold={fold}, auc={auc}")
df_val.loc[:,"lr_pred"] = pred
return df_val[["id","kfold","target_class","lr_pred"]]
if __name__ == '__main__':
dfs = []
for i in range(5):
temp_df = run_training(i)
dfs.append(temp_df)
fin_valid_df = pd.concat(dfs)
print(fin_valid_df.shape)
fin_valid_df.to_csv(config.LR_MODEL_PRED,index=False)