0

So, as can be seen here, here and here, we should retrain our model using the whole dataset after we are satisfied with our CV results.

Check the following code to train a Random Forest:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold

n_splits = 5
kfold = KFold(n_splits=n_splits)
classifier_RF = RandomForestClassifier(n_estimators=100,
                                       criterion='entropy',
                                       min_samples_split=2,
                                       min_samples_leaf=1,
                                       random_state=1)

for i, (train_index, val_index) in enumerate(kfold.split(x_train, y_train)):
    print('Fold:', i)
    x_train_fold, x_val_fold = x_train[train_index], x_train[val_index]
    y_train_fold, y_val_fold = y_train[train_index], y_train[val_index]
    classifier_RF.fit(x_train_fold, y_train_fold)
    y_pred_fold = classifier_RF.predict(x_val_fold)
    print(classification_report(y_val_fold, y_pred_fold))

Do i need to create a new RandomForestClassifier to retrain on the whole dataset? Or can i just use the classifier_RF one?

Should i do this:

new_classfier_RF = RandomForestClassifier(n_estimators=100,
                                       criterion='entropy',
                                       min_samples_split=2,
                                       min_samples_leaf=1,
                                       random_state=1)

new_classifier_RF.fit(x_train, y_train)
y_test = new_classifier_RF.predict(x_test) # y_test was saved before using train_test_split

or should i do this:

classifier_RF.fit(x_train, y_train)
y_test = new_classifier_RF.predict(x_test) # y_test was saved before using train_test_split

And why? The random_state=integer changes anything in any (or both) of these approaches?

Murilo
  • 533
  • 3
  • 15

0 Answers0