I am running a regression model of a set of continuous variables and a continuous target. This is my code:
def run_RandomForest(xTrain,yTrain,xTest,yTest):
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define the pipeline to evaluate
model = RandomForestRegressor()
fs = SelectKBest(score_func=mutual_info_regression)
pipeline = Pipeline(steps=[('sel',fs), ('rf', model)])
# define the grid
grid = dict()
grid['sel__k'] = [i for i in range(1, xTrain.shape[1]+1)]
search = GridSearchCV(
pipeline,
param_grid={
'rf__bootstrap': [True, False],
'rf__max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
'rf__max_features': ['auto', 'sqrt'],
'rf__min_samples_leaf': [1, 2, 4],
'rf__min_samples_split': [2, 5, 10],
'rf__n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
},
scoring='neg_mean_squared_error',
return_train_score=True,
verbose=1,
cv=5,
n_jobs=-1)
# perform the fitting
results = search.fit(xTrain, yTrain)
# predict prices of X_test
y_pred = results.predict(xTest)
run_RandomForest(x_train,y_train,x_test_y_test)
I want to understand if this model is over-fitting. I read that incorporating cross-validation is an effective way to check this.
You can see I've incorporated cv into the code above. However, I'm totally stuck on the next step. Can someone demonstrate to me the code that will take the cv information, and produce either a plot or set of statistics that I'm meant to analyse for over-fitting? I know there are some questions like this on SO (e.g. here and here), but i'm not understanding from either of these how specifically to translate to my situation, because in both of these examples, they just initialise a model and fit it, whereas mine incorporates GridSearchCV?