After validating the performance of my regression model with cross_validate I obtain some results following the 'r2'
scoring.
That's what my code is doing
scores = cross_validate(RandomForestRegressor(),X,y,cv=5,scoring='r2')
and what I get is
>>scores['test_score']
array([0.47146303, 0.47492019, 0.49350646, 0.56479323, 0.56897343])
For more flexibility, I've also written my own cross validation function which is the following
def my_cross_val(estimator, X, y):
r2_scores = []
kf = KFold(shuffle=True)
for train_index, test_index in kf.split(X,y):
estimator.fit(X.iloc[train_index].values, y.iloc[train_index].values)
preds = estimator.predict(X.iloc[test_index].values)
r2 = r2_score(y.iloc[test_index].values, preds)
r2_scores.append(r2)
return np.array(r2_scores)
Running now
scores = my_cross_val(RandomForestRegressor(),X,y)
I obtain
array([0.6975932 , 0.68211856, 0.62892119, 0.64776752, 0.66046326])
Am I doing something wrong in
my_cross_val()
as the values seem that overestimated compared to cross_validate()
? Maybe putting shuffle=True
inside KFold
?