Retrain model after CrossValidation

Question

So, as can be seen here, here and here, we should retrain our model using the whole dataset after we are satisfied with our CV results.

Check the following code to train a Random Forest:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold

n_splits = 5
kfold = KFold(n_splits=n_splits)
classifier_RF = RandomForestClassifier(n_estimators=100,
                                       criterion='entropy',
                                       min_samples_split=2,
                                       min_samples_leaf=1,
                                       random_state=1)

for i, (train_index, val_index) in enumerate(kfold.split(x_train, y_train)):
    print('Fold:', i)
    x_train_fold, x_val_fold = x_train[train_index], x_train[val_index]
    y_train_fold, y_val_fold = y_train[train_index], y_train[val_index]
    classifier_RF.fit(x_train_fold, y_train_fold)
    y_pred_fold = classifier_RF.predict(x_val_fold)
    print(classification_report(y_val_fold, y_pred_fold))

Do i need to create a new RandomForestClassifier to retrain on the whole dataset? Or can i just use the classifier_RF one?

Should i do this:

new_classfier_RF = RandomForestClassifier(n_estimators=100,
                                       criterion='entropy',
                                       min_samples_split=2,
                                       min_samples_leaf=1,
                                       random_state=1)

new_classifier_RF.fit(x_train, y_train)
y_test = new_classifier_RF.predict(x_test) # y_test was saved before using train_test_split

or should i do this:

classifier_RF.fit(x_train, y_train)
y_test = new_classifier_RF.predict(x_test) # y_test was saved before using train_test_split

And why? The random_state=integer changes anything in any (or both) of these approaches?

Retrain model after CrossValidation

0 Answers0