I have some troubles to implement cross-validation. I understand that after cross-validation I have to re-train the model but I have the next doubts:
- Do train_test split before cross validation and use X_train and y_train for cross-validation process and then re-train model with X_train and y_train
- Split data in features (X) and labels (y) and use this variables in cross-validation process and then do train test split and train model with X_train and y_train
- If I use features and label variables what is the next step after cross-validation?
Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
data = pd.read_csv('../data/pima-indians-diabetes.csv')
data.head()
# All the columns except the one we want to predict
features = data.drop(['Outcome'], axis=1)
# Only the column we want to predict
labels = data['Outcome']
from sklearn.model_selection import train_test_split
test_size = 0.33
seed = 12
X_train, X_test, Y_train, Y_test = train_test_split(features, labels, test_size=test_size,
random_state=seed)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
scores = cross_val_score(model, X_train, y_train, cv=kfold)`
model.fit(X_train, Y_train)
kfold = KFold(n_splits=10, random_state=1)
model = LogisticRegression()
scores = cross_val_score(model, features, labels, cv=kfold)
X_train, X_test, Y_train, Y_test = train_test_split(features, labels, test_size=0.2,
random_state=42)
model.fit(X_train, Y_train)
Which of the two code blocks is correct or is there another way to implement cross-validation correctly?