How is train_test_split with test_size=0 affecting the data?

Question

I was using train_test_split in my code and then wanted to change it to cross validation, but something strange is hapenning.

train, test = train_test_split(data, test_size=0)

x_train = train.drop('CRO', axis=1)
y_train = train['CRO']

scaler = MinMaxScaler(feature_range=(0, 1))
x_train_scaled = scaler.fit_transform(x_train)
x_train = pd.DataFrame(x_train_scaled)

for k in range(1, 5):
    knn = neighbors.KNeighborsRegressor(n_neighbors=k, weights='uniform')
    scores = model_selection.cross_val_score(knn, x_train, y_train, cv=5)
    print(scores.mean(), 'score for k = ', k)

This code gives the scores around 0.8, but when I delete that first line and change the 'train' set for the 'data' set in the 2nd and 3rd lines, the score changes to 0.2, which is strange because I even set the test_size to 0 so the train should be equal to the whole data. What is hapenning?

One thing to be aware of are the implicit arguments passed in train_test_split. By default, `shuffle=True`, which could easily be adding some noise into your training data by shuffling it, where just passing the data in without shuffling my be introducing some other pattern into the model — G. Anderson, Apr 30 '19 at 16:59
I didn't thought about the fact that the data was sorted, I added shuffling and it works again. Thank you! — SlimakSlimak, Apr 30 '19 at 17:31
Since it resolved your issue, I moved my comment into an answer so you can accept it if you wish. Cheers! — G. Anderson, Apr 30 '19 at 17:33

score 1 · Accepted Answer · answered Apr 30 '19 at 17:32

One thing to be aware of are the implicit arguments passed in train_test_split.

By default, shuffle=True, which could easily be adding some noise into your training data by shuffling it, where just passing the data in without shuffling my be introducing some other pattern into the model.

How is train_test_split with test_size=0 affecting the data?

1 Answers1