0

After validating the performance of my regression model with cross_validate I obtain some results following the 'r2' scoring.

That's what my code is doing

scores = cross_validate(RandomForestRegressor(),X,y,cv=5,scoring='r2')

and what I get is

>>scores['test_score']

array([0.47146303, 0.47492019, 0.49350646, 0.56479323, 0.56897343])

For more flexibility, I've also written my own cross validation function which is the following

def my_cross_val(estimator, X, y):
    
    r2_scores = []
    
    kf = KFold(shuffle=True)
    
    for train_index, test_index in kf.split(X,y):
        
        estimator.fit(X.iloc[train_index].values, y.iloc[train_index].values)
        preds = estimator.predict(X.iloc[test_index].values)
                
        r2 = r2_score(y.iloc[test_index].values, preds)
                    
        r2_scores.append(r2)
        
    return np.array(r2_scores)

Running now

scores = my_cross_val(RandomForestRegressor(),X,y)

I obtain

array([0.6975932 , 0.68211856, 0.62892119, 0.64776752, 0.66046326])

Am I doing something wrong in

my_cross_val()

as the values seem that overestimated compared to cross_validate() ? Maybe putting shuffle=True inside KFold?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
James Arten
  • 523
  • 5
  • 16
  • Shuffling can make a *huge* difference, but we cannot provide an answer without a [mre]; try manually shuffling your data before applying `cross_validate`. – desertnaut Jan 28 '22 at 12:38
  • That's it... I have shuffled data into cross_validate or cross_val_score and the same results as my function are reached... The main point I wanted to be 100% sure about is that shuffling only happens **before**, as we don't want our model to predict points on which it has been trained.. this definitely would lead to overestimating the score – James Arten Jan 28 '22 at 12:43

1 Answers1

0

In order to be sure that you are comparing apples to apples, and given that shuffling can have a huge difference in such cases, here is what you should do:

First, shuffle your data manually:

from sklearn.utils import shuffle
X_s, y_s = shuffle(X, y, random_state=42)

Then, run cross_validate with these shuffled data:

scores = cross_validate(RandomForestRegressor(),X_s, y_s, cv=5, scoring='r2')

Change your function to use

kf = KFold(shuffle=False) # no more shuffling (although it should not hurt)

and run it with the already shuffled data:

scores = my_cross_val(RandomForestRegressor(), X_s, y_s)

Now the results should be similar - but not yet identical. You could turn them to identical if you define already kf = KFold(shuffle=False, random_state=0) before (and outside of the function), and run cross_validate as

scores = cross_validate(RandomForestRegressor(), X_s, y_s, cv=kf, scoring='r2') # cv=kf

i.e. using the exact same CV partition in both cases (you should also set the same random_state to the kf definition inside the function).

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • many thanks. Is it common behaviour to put shuffle=True or False in KFold? – James Arten Jan 28 '22 at 13:02
  • 1
    @JamesArten The default is `False`, but depends on your setting. In general, extra shuffling never hurts ;) – desertnaut Jan 28 '22 at 13:03
  • It doesn't hurt at all, it increases my r2_score of 0.20 xD I just wanted to be sure that I'm not doing anything **wrong** setting it to True in my KFold. – James Arten Jan 28 '22 at 13:10