Unexpected cross-validation scores with scikit-learn LinearRegression

Question

I am trying to learn to use scikit-learn for some basic statistical learning tasks. I thought I had successfully created a LinearRegression model fit to my data:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    X, y,
    test_size=0.2, random_state=0)

model = linear_model.LinearRegression()
model.fit(X_train, y_train)
print model.score(X_test, y_test)

Which yields:

0.797144744766

Then I wanted to do multiple similar 4:1 splits via automatic cross-validation:

model = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(model, X, y, cv=5)
print scores

And I get output like this:

[ 0.04614495 -0.26160081 -3.11299397 -0.7326256  -1.04164369]

How can the cross-validation scores be so different from the score of the single random split? They are both supposed to be using r2 scoring, and the results are the same if I pass the scoring='r2' parameter to cross_val_score.

I've tried a number of different options for the random_state parameter to cross_validation.train_test_split, and they all give similar scores in the 0.7 to 0.9 range.

I am using sklearn version 0.16.1

score 4 · Answer 1 · answered Nov 10 '15 at 23:12

It turns out that my data was ordered in blocks of different classes, and by default cross_validation.cross_val_score picks consecutive splits rather than random (shuffled) splits. I was able to solve this by specifying that the cross-validation should use shuffled splits:

model = linear_model.LinearRegression()
shuffle = cross_validation.KFold(len(X), n_folds=5, shuffle=True, random_state=0)
scores = cross_validation.cross_val_score(model, X, y, cv=shuffle)
print scores

Which gives:

[ 0.79714474  0.86636341  0.79665689  0.8036737   0.6874571 ]

This is in line with what I would expect.

score 3 · Accepted Answer · answered Nov 10 '15 at 23:39

train_test_split seems to generate random splits of the dataset, while cross_val_score uses consecutive sets, i.e.

"When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by default"

http://scikit-learn.org/stable/modules/cross_validation.html

Depending on the nature of your data set, e.g. data highly correlated over the length of one segment, consecutive sets will give vastly different fits than e.g. random samples from the whole data set.

score 0 · Answer 3 · answered Mar 18 '18 at 05:44

Folks, thanks for this thread.

The code in the answer above (Schneider) is outdated.

As of scikit-learn==0.19.1, this will work as expected.

from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=3, shuffle=True, random_state=0)
cv_scores = cross_val_score(regressor, X, y, cv=kf)

Best,

M.

Unexpected cross-validation scores with scikit-learn LinearRegression

3 Answers3

Linked