RandomForestRegressor used with GridSearchCV and RandomSearchCV may be overfitting on test set

Question

I am following along with the book titled: Hands-On Machine Learning with SciKit-Learn, Keras and TensorFlow by Aurelien Geron (link). In chapter 2 you get hands on with actually building an ML system using a dataset from StatLib's California Housing Prices (link).

I have been running cross validation tests using BOTH GridSearchCV and RandomSearchCV to test and see which performs better (they both perform about the same, depending on the run GridSearch will perform better than RandomSearch and vice versa). During my cross validation of the training set, all of my RMSE's come back (after about 10 folds) looking like so:

49871.10156541779 {'max_features': 6, 'n_estimators': 100} GRID SEARCH CV
49573.67188289324 {'max_features': 6, 'n_estimators': 300} GRID SEARCH CV
49759.116323927 {'max_features': 8, 'n_estimators': 100} GRID SEARCH CV
49388.93702859155 {'max_features': 8, 'n_estimators': 300} GRID SEARCH CV
49759.445071611895 {'max_features': 10, 'n_estimators': 100} GRID SEARCH CV
49517.74394767381 {'max_features': 10, 'n_estimators': 300} GRID SEARCH CV
49796.22587441326 {'max_features': 12, 'n_estimators': 100} GRID SEARCH CV
49616.61833604992 {'max_features': 12, 'n_estimators': 300} GRID SEARCH CV
49795.571075148444 {'max_features': 14, 'n_estimators': 300} GRID SEARCH CV
49790.38581725693 {'n_estimators': 100, 'max_features': 12} RANDOM SEARCH CV
49462.758078362356 {'n_estimators': 300, 'max_features': 8} RANDOM SEARCH CV

Please note that I am selecting the best results out of about 50 or so results to present here. I am using the following code to generate this:

param_grid = [{'n_estimators' : [3, 10, 30, 100, 300],
               'max_features' : [2, 4, 6, 8, 10, 12, 14]},
              {'bootstrap' : [False], 'n_estimators' : [3, 10, 12],
               'max_features' : [2, 3, 4]}]

forest_regressor = RandomForestRegressor({'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse',
                                          'max_depth': None, 'max_features': 8, 'max_leaf_nodes': None,
                                          'max_samples': None, 'min_impurity_decrease': 0.0,
                                          'min_impurity_split': None, 'min_samples_leaf': 1,
                                          'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0,
                                          'n_estimators': 300, 'n_jobs': None, 'oob_score': False,
                                          'random_state': None, 'verbose': 0, 'warm_start': False})

grid_search = GridSearchCV(forest_regressor, param_grid, cv=10, scoring="neg_mean_squared_error",
                           return_train_score=True, refit=True)

grid_search.fit(Dataframe, TrainingLabels)
prediction = grid_search.predict(Dataframe)
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params, "GRID SEARCH CV")
##################################################################################
#Randomized Search Cross Validation

param_grid = [{'n_estimators' : [3, 10, 30, 100, 300],
               'max_features' : [2, 4, 6, 8, 10, 12, 14]},
              {'bootstrap' : [False], 'n_estimators' : [3, 10, 12],
               'max_features' : [2, 3, 4]}]

forest_regressor = RandomForestRegressor({'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse',
                                          'max_depth': None, 'max_features': 8, 'max_leaf_nodes': None,
                                          'max_samples': None, 'min_impurity_decrease': 0.0,
                                          'min_impurity_split': None, 'min_samples_leaf': 1,
                                          'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0,
                                          'n_estimators': 300, 'n_jobs': None, 'oob_score': False,
                                          'random_state': None, 'verbose': 0, 'warm_start': False})

rand_search = RandomizedSearchCV(forest_regressor, param_grid, cv=10, refit=True,
                            scoring='neg_mean_squared_error', return_train_score=True)
rand_search.fit(Dataframe, TrainingLabels)
prediction = rand_search.predict(Dataframe)
cvres = rand_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params, "RANDOM SEARCH CV")

Now, I am doing things a little differently than what the book states; my pipeline looks as such:

import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy import stats

class Dataframe_Manipulation:
    def __init__(self):
        self.dataframe = pd.read_csv(r'C:\Users\bohayes\AppData\Local\Programs\Python\Python38\Excel and Text\housing.csv')
    def Cat_Creation(self):
        # Creation of an Income Category to organize the median incomes into strata (bins) to sample from
        self.income_cat = self.dataframe['income_category'] = pd.cut(self.dataframe['median_income'],
                                      bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                                      labels=[1, 2, 3, 4, 5])
        self.rooms_per_house_cat = self.dataframe['rooms_per_house'] = self.dataframe['total_rooms']/self.dataframe['households']
        self.bedrooms_per_room_cat = self.dataframe['bedrooms_per_room'] = self.dataframe['total_bedrooms']/self.dataframe['total_rooms']
        self.pop_per_house = self.dataframe['pop_per_house'] = self.dataframe['population'] / self.dataframe['households']
        return self.dataframe
    def Fill_NA(self):
        self.imputer = KNNImputer(n_neighbors=5, weights='uniform')
        self.dataframe['total_bedrooms'] = self.imputer.fit_transform(self.dataframe[['total_bedrooms']])
        self.dataframe['bedrooms_per_room'] = self.imputer.fit_transform(self.dataframe[['bedrooms_per_room']])
        return self.dataframe
    def Income_Cat_Split(self):
        self.inc_cat_split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
        for self.train_index, self.test_index in self.inc_cat_split.split(self.dataframe, self.dataframe['income_category']):
            self.strat_train_set = self.dataframe.loc[self.train_index].reset_index(drop=True)
            self.strat_test_set = self.dataframe.loc[self.test_index].reset_index(drop=True)
            # the proportion is the % of total instances and which strata they are assigned to
            self.proportions = self.strat_test_set['income_category'].value_counts() / len(self.strat_test_set)
            # Only pulling out training set!!!!!!!!!!!!!!!
            return self.strat_train_set, self.strat_test_set
    def Remove_Cats_Test(self):
        self.test_labels = self.strat_test_set['median_house_value'].copy()
        self.strat_test_set = self.strat_test_set.drop(['median_house_value'], axis=1)
        return self.test_labels
    def Remove_Cats_Training(self):
        self.training_labels = self.strat_train_set['median_house_value'].copy()
        self.strat_train_set = self.strat_train_set.drop(['median_house_value'], axis=1)
        return self.training_labels
    def Encode_Transform(self):
        self.column_trans = make_column_transformer((OneHotEncoder(), ['ocean_proximity']), remainder='passthrough')
        self.training_set_encoded = self.column_trans.fit_transform(self.strat_train_set)
        self.test_set_encoded = self.column_trans.fit_transform(self.strat_test_set)
        return self.training_set_encoded, self.test_set_encoded
    def Standard_Scaler(self):
        self.scaler = StandardScaler()
        self.scale_training_set = self.scaler.fit(self.training_set_encoded)
        self.scale_test_set = self.scaler.fit(self.test_set_encoded)
        self.scaled_training_set = self.scaler.transform(self.training_set_encoded)
        self.scaled_test_set = self.scaler.transform(self.test_set_encoded)
        return self.scaled_training_set
    def Test_Set(self):
        return self.scaled_test_set
    
A = Dataframe_Manipulation()
B = A.Cat_Creation()
C = A.Fill_NA()
D = A.Income_Cat_Split()
TestLabels = A.Remove_Cats_Test()
TrainingLabels = A.Remove_Cats_Training()
G = A.Encode_Transform()
TrainingSet = A.Standard_Scaler()
TestSet = A.Test_Set()

The Grid and Random Searches come after this bit, however my RMSE scores come back drastically different when I test them on the TestSet, which leads me to believe that I am overfitting, however maybe the RSME's look different because I am using a smaller test set? Here you go:

19366.910530221918
19969.043158986697

Now here is the code that generates that: and it comes after I run Grid and Random Searches and fit the test labels and test set to the model:

#Final Grid Model
final_grid_model = grid_search.best_estimator_

final_grid_prediction = final_grid_model.predict(TestSet)
final_grid_mse = mean_squared_error(TestLabels, final_grid_prediction)
final_grid_rmse = np.sqrt(final_grid_mse)
print(final_grid_rmse)
###################################################################################
#Final Random Model
final_rand_model = rand_search.best_estimator_

final_rand_prediction = final_rand_model.predict(TestSet)
final_rand_mse = mean_squared_error(TestLabels, final_rand_prediction)
final_rand_rmse = np.sqrt(final_rand_mse)
print(final_rand_rmse)

Just to make sure I also did a confidence score on the model as well and these are the code and results:

#Confidence Grid Search 
confidence = 0.95
squared_errors = (final_grid_prediction - TestLabels) ** 2
print(np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
                         loc=squared_errors.mean(),
                         scale=stats.sem(squared_errors))))
###################################################################################
#Confidence Random Search 
confidence1 = 0.95
squared_errors1 = (final_rand_prediction - TestLabels) ** 2
print(np.sqrt(stats.t.interval(confidence1, len(squared_errors1) - 1,
                         loc=squared_errors1.mean(),
                         scale=stats.sem(squared_errors1))))
                         

>>>[18643.4914044  20064.26363526]
[19222.30464011 20688.84660134]

Why is it that my average RMSE score on the TrainingSet is about 49,000 and that same score on the test set is averaging at about 19,000? I must be overfitting, but I am not sure how or where I am going wrong.

Please do not format your Python code as Javascript snippets (edited). — desertnaut, Dec 11 '20 at 22:55
Your test RMSE is *lower* than your training one, i.e. your model actually performs better on the *test* set than on the training data; this cannot be overfitting by definition. — desertnaut, Dec 11 '20 at 23:01
Hey - Thank you for getting back to me, I am new to Stack Overflow and I could not really figure out how to format my code, sorry about that. Additionally, I am just concerned because my RMSE is roughly 30,000 less on my test than on my training. Versus, when I read the book, their test set score is almost identical. If you have the time, would you be able to briefly explain perhaps why it is improving so drastically? Could it be underfitting? — LedZeppelin1969, Dec 14 '20 at 15:30
Underftting does not explain anything here; my 2 cents: instead of such "exotic" explanations (overfitting/underfitting), most probably your (unnecessarily convoluted) code does not do what it should do (i.e you have coding issues). — desertnaut, Dec 14 '20 at 17:03

score 1 · Answer 1 · answered Dec 14 '20 at 17:25

tl;dr: Your code is unnecessarily convoluted for such a (standard) job; do not re-invent the wheel, go with a pipeline instead.

There is an error in how you scale your data, which most probably is the root cause of the observed behavior here; in the second line:

    self.scale_training_set = self.scaler.fit(self.training_set_encoded)
    self.scale_test_set = self.scaler.fit(self.test_set_encoded)

you essentially overwrite your scaler with the results on the test set fit, and subsequently you actually scale your training data with this test-fitted scaler:

    self.scaled_training_set = self.scaler.transform(self.training_set_encoded)

Since your test set is only 20% of the dataset, what happens is that it does not contain enough values to adequately cover the whole range (min-max) of the (bigger) training set; as a result, the training set is mis-scaled (actually containing values well above the max value of the test set), which probably leads to a higher RMSE (which is not scale invariant, and by definition depends on the scale pf the predictions).

You may think that using StratifiedShuffleSplit upstream should have protected you from such a case, but truth is that StratifiedShuffleSplit is only good for classification datasets, and it is actually meaningless in regression ones (I am genuinely surprised that it does not throw an error here).

To remedy this issue, you should just remove the line

    self.scale_test_set = self.scaler.fit(self.test_set_encoded)

from your Standard_Scaler() function.

Keep in mind that, in general, we never fit on a test set - we only transform; scikit-learn pipelines, apart from saving you from having to write all this boilerplate code (thus increasing the probability of coding errors), will protect you from this kind of error...

Thank you very much for the detailed response! I have followed your advice and I removed the line in question. However this actually causes my RMSE score to be even better on average than before. I was expecting to have my RMSE drop down to the 40,000's at that point. Would you be able to explain more about why I wouldn't want to "fit" the test set? Only if you have the time. Also, when fitting on StandardScaler, what is the difference between fitting and transforming? — LedZeppelin1969, Dec 15 '20 at 15:52
Ultimately, I am very new to ML, and I am not trying to reinvent anything, but I am testing my ability to independently come up with code that can perform the same operations without copying it. I will check out the pipeline link. Thank you very much for all your help. — LedZeppelin1969, Dec 15 '20 at 15:53
in short, see [What's the difference between fit and fit_transform in scikit-learn models?](https://datascience.stackexchange.com/questions/12321/whats-the-difference-between-fit-and-fit-transform-in-scikit-learn-models) and [what is the difference between 'transform' and 'fit_transform' in sklearn](https://stackoverflow.com/q/23838056/4685471) (hold for everything equipped with `fit` and `transform` methods). And go with a pipeline, as already suggested. — desertnaut, Dec 15 '20 at 16:23
Hey, so I have had an interesting update, I changed 'self.scaled_training_set = self.scaler.fit(self.training_set_encoded)' to 'self.scaled_training_set = self.scaler.fit_transform(self.training_set_encoded)' and now my RMSE on the Grid and Random Search CV's respectively with the training set is 18253.66378264979 18556.824376774486. Do you think this would have affected it? I am no longer fitting and then transforming. Let me know what you think. — LedZeppelin1969, Dec 23 '20 at 00:57

RandomForestRegressor used with GridSearchCV and RandomSearchCV may be overfitting on test set

1 Answers1