So, as can be seen here, here and here, we should retrain our model using the whole dataset after we are satisfied with our CV results.
Check the following code to train a Random Forest:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
n_splits = 5
kfold = KFold(n_splits=n_splits)
classifier_RF = RandomForestClassifier(n_estimators=100,
criterion='entropy',
min_samples_split=2,
min_samples_leaf=1,
random_state=1)
for i, (train_index, val_index) in enumerate(kfold.split(x_train, y_train)):
print('Fold:', i)
x_train_fold, x_val_fold = x_train[train_index], x_train[val_index]
y_train_fold, y_val_fold = y_train[train_index], y_train[val_index]
classifier_RF.fit(x_train_fold, y_train_fold)
y_pred_fold = classifier_RF.predict(x_val_fold)
print(classification_report(y_val_fold, y_pred_fold))
Do i need to create a new RandomForestClassifier
to retrain on the whole dataset? Or can i just use the classifier_RF
one?
Should i do this:
new_classfier_RF = RandomForestClassifier(n_estimators=100,
criterion='entropy',
min_samples_split=2,
min_samples_leaf=1,
random_state=1)
new_classifier_RF.fit(x_train, y_train)
y_test = new_classifier_RF.predict(x_test) # y_test was saved before using train_test_split
or should i do this:
classifier_RF.fit(x_train, y_train)
y_test = new_classifier_RF.predict(x_test) # y_test was saved before using train_test_split
And why? The random_state=integer
changes anything in any (or both) of these approaches?