Hyperparameter Tuning in Random forest

Question

I was trying Random Forest Algorithm on Boston dataset to predict the house prices medv with the help of sklearn's RandomForestRegressor.In all I tried 3 iterations as below

Iteration 1: Using the model with default hyperparameters

#1. import the class/model
from sklearn.ensemble import RandomForestRegressor
#2. Instantiate the estimator
RFReg = RandomForestRegressor(random_state = 1, n_jobs = -1) 
#3. Fit the model with data aka model training
RFReg.fit(X_train, y_train)

#4. Predict the response for a new observation
y_pred = RFReg.predict(X_test)


y_pred_train = RFReg.predict(X_train)

Results of Iteration 1

{'RMSE Test': 2.9850839211419435, 'RMSE Train': 1.2291604936401441}

Iteration 2: I used RandomizedSearchCV to get optimum values of hyper-parameters

from sklearn.ensemble import RandomForestRegressor
RFReg = RandomForestRegressor(n_estimators = 500, random_state = 1, n_jobs = -1) 

param_grid = { 
    'max_features' : ["auto", "sqrt", "log2"],
    'min_samples_split' : np.linspace(0.1, 1.0, 10),
     'max_depth' : [x for x in range(1,20)]


from sklearn.model_selection import RandomizedSearchCV
CV_rfc = RandomizedSearchCV(estimator=RFReg, param_distributions =param_grid, n_jobs = -1, cv= 10, n_iter = 50)
CV_rfc.fit(X_train, y_train)

So I got the best hyperparameters as follows

CV_rfc.best_params_
#{'min_samples_split': 0.1, 'max_features': 'auto', 'max_depth': 18}
CV_rfc.best_score_
#0.8021713812777814

So I trained a new model with best hyperparameters as below

#1. import the class/model
from sklearn.ensemble import RandomForestRegressor
#2. Instantiate the estimator
RFReg = RandomForestRegressor(n_estimators = 500, random_state = 1, n_jobs = -1, min_samples_split = 0.1, max_features = 'auto', max_depth = 18) 
#3. Fit the model with data aka model training
RFReg.fit(X_train, y_train)

#4. Predict the response for a new observation
y_pred = RFReg.predict(X_test)


y_pred_train = RFReg.predict(X_train)

Results of Iteration 2

{'RMSE Test': 3.2836794902147926, 'RMSE Train': 2.71230367772569}

Iteration 3: I use GridSearchCV to get optimum values of hyper-parameters

from sklearn.ensemble import RandomForestRegressor
RFReg = RandomForestRegressor(n_estimators = 500, random_state = 1, n_jobs = -1) 

param_grid = { 
    'max_features' : ["auto", "sqrt", "log2"],
    'min_samples_split' : np.linspace(0.1, 1.0, 10),
     'max_depth' : [x for x in range(1,20)]

}

from sklearn.model_selection import GridSearchCV
CV_rfc = GridSearchCV(estimator=RFReg, param_grid=param_grid, cv= 10, n_jobs = -1)
CV_rfc.fit(X_train, y_train)

So I got the best hyperparameters as follows

CV_rfc.best_params_
#{'max_depth': 12, 'max_features': 'auto', 'min_samples_split': 0.1}
CV_rfc.best_score_
#0.8021820114800677

Results of Iteration 3

{'RMSE Test': 3.283690568225705, 'RMSE Train': 2.712331014201783}

My Function to evaluate RMSE

def model_evaluate(y_train, y_test, y_pred, y_pred_train):
    metrics = {}
    #RMSE Test
    rmse_test = np.sqrt(mean_squared_error(y_test, y_pred))
    #RMSE Train
    rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))

    metrics = {
              'RMSE Test': rmse_test,
              'RMSE Train': rmse_train}

    return metrics

So I had below questions after 3 iterations

Why are the results of tuned model worst than the model with default parameters even when I am using RandomSearchCV and GridSearchCV. Ideally the model should give good results when tuned with cross-validation
I know that cross-validation will take place only for the combination of values present in param_grid.There could be values which are good but not included in my param_grid. So how do I deal with this kind of situation
How do I decide what range of values I should try for max_features, min_samples_split, max_depth or for that matter any hyper-parameters in a machine learning model to increase its accuracy.(So that I can at least get a better tuned model than the model with default hyper-parameters)

There are not "hard" scientific answers (or even broad guidelines) for your questions #2 & #3; this is part of the art, coming from experience (you have arguably gained some yourself, which might translate in practice "always try the default parameters, too"); and probably a possible answer to your question #1 may lie in your question #2... — desertnaut, Nov 29 '18 at 18:10

score 3 · Answer 1 · answered Nov 30 '18 at 04:59

Why are the results of tuned model worst than the model with default parameters even when I am using RandomSearchCV and GridSearchCV. Ideally the model should give good results when tuned with cross-validation

Your second question kind of answers your first one, but I tried to reproduce your results on the boston dataset, I got {'test_rmse':3.987, 'train_rmse':1.442} with default parameters, {'test_rmse':3.98, 'train_rmse':3.426} for 'tuned' parameters with random search and {'test_rmse':3.993, 'train_rmse':3.481} with grid search. Then I used hyperopt with following parameter space

 {'max_depth': hp.choice('max_depth', range(1, 100)),
    'max_features': hp.choice('max_features', range(1, x_train.shape[1])),
    'min_samples_split': hp.uniform('min_samples_split', 0.1, 1)}

After about 200 runs results looked like this, so I widened the space to 'min_samples_split', 0.01, 1 which got me the best result of {'test_rmse':3.278, 'train_rmse':1.716} with min_samples_split equal to 0.01. According to documentation the formula for min_samples_split is ceil(min_samples_split * n_samples) which in our case gives np.ceil(0.1 * len(x_train))=34 which could be kind of big for a small dataset like this.

I know that cross-validation will take place only for the combination of values present in param_grid.There could be values which are good but not included in my param_grid. So how do I deal with this kind of situation

How do I decide what range of values I should try for max_features, min_samples_split, max_depth or for that matter any hyper-parameters in a machine learning model to increase its accuracy.(So that I can atleast get a better tuned model than the model with default hyper-parameters)

You can't know this in advance, so you have to do research for each algorithm to see what kind of parameter spaces are usually searched (good source for this is kaggle, e.g. google kaggle kernel random forest), merge them, account for your dataset features and optimize over them using some kind of Bayesian Optimization algorithm (there are multiple existing libraries for this) which try to optimally select for a new parameter value to choose.

Neat demonstration of how `min_samples_split` can be a more useful (consistent) hyperparameter to limit depth than `max_depth`. I guess `min_samples_leaf` would look similar? — Jon Nordby, Dec 16 '18 at 22:56

Hyperparameter Tuning in Random forest

1 Answers1