Why does shuffling training data for cross validation increase performance?

Question

I am working on unbalanced dataset and I noticed that strangely if I shuffle the data during cross validation I get a high value of the f1 score while if i do not shuffle it f1 is low. Here is the function I use for cross validation:

def train_cross_v(md,df_train,n_folds=5,shuffl=False):

        X,y=df_train.drop([variable],axis=1),df_train[variable]
    
        cv =StratifiedKFold(n_splits=n_folds,shuffle=shuffl)

        scores = cross_val_score(md,X,y, scoring='f1', cv=cv, n_jobs=-1)

        y_pred=cross_val_predict(md,X,y, cv=cv, n_jobs=-1)
        print(' f1: ',scores,np.mean(scores))
        print(confusion_matrix(y_pred,y))
        return np.mean(scores)

Now shuffling I get f1 around 0.82:

nfolds=5
train_cross_v(XGBClassifier(),df_train,n_folds=nfolds,shuffl=True)
f1:  [0.81469793 0.82076749 0.82726257 0.82379249 0.82484862] 0.8222738195197493
[[23677  2452]
[ 1520  9126]]
0.8222738195197493

While not shuffling leads to:

nfolds=5
train_cross_v(XGBClassifier(),df_train,n_folds=nfolds,shuffl=False) 

f1:  [0.67447073 0.55084022 0.4166443  0.52759421 0.64819164] 0.5635482198057791
[[21621  5624]
[ 3576  5954]]
0.5635482198057791

As I understand it, shuffling is preferred to assess the real performance of the model as it allows us to neglect any dependencies related to the ordering of the data, and usually the post shuffling value of the performance metric is lower than that without shuffling. In my case however the behavior is the exact opposite and I get a high value if I shuffle, and the values of the predictions on the test set remain unchanged. What could be the problem here?

"*As I understand it [...] usually the post shuffling value of the performance metric is lower than that without shuffling.*" - no, your understanding is wrong. Shuffling is needed in order to break any artificially imposed ordering in the (non-timeseries) data which may harm the learning process, and it always leads to more *reliable* performance estimates. — desertnaut, Sep 09 '22 at 23:44
yes i see but in my case shuffling gives me around 0.82 f1 score which is compleatly unreliable when it comes to the test dataset, and real f1 is closer to the one i get when i dont shuffle... — Sunny, Sep 10 '22 at 14:45
so basically i have same actual performances on test data but 2 very different performances on cross validation... — Sunny, Sep 10 '22 at 14:53

score 1 · Accepted Answer · answered Sep 09 '22 at 20:15

Because the order of your data is important. Let's consider the following example:

Suppose we have completely balanced labels:

[0, 1, 0, 1, 0, 1, 0, 1, 0, ...]

And the features matrix that matches the labels, i.e:

[
[0, 1, 0, 1, ..],
]

Suppose the first 25% of the data are noisy and have incorrect labels:

n_noisy = int(n_examples * 0.25)
X[:n_noisy] = 1 - X[:n_noisy]

So we have: [25% noisy, 25% normal, 25% normal, 25% normal]

Now we are using 2-fold cross validation (2 for simplicity).

4.1 without shuffling we will have the following metrics:

 f1:  [0.5 0. ] 0.25  # metrics for the second fold is zero

The first fold will be trained on the second half of data ([25% normal, 25% normal]) which have no noise in it and tested on the first half ([25%noisy, 25% normal]) which have 50% of noise in it which results in f1=0.5.

The second fold will be trained on the first half of data which were inverted as a result f1=0

4.2 with shuffling:

 f1:  [0.74903475 0.75103734] 0.7500360467165447

As expected we have f1=75% because 25% are noise.

Source code:

from xgboost import XGBClassifier
import numpy as np
from sklearn.model_selection import StratifiedKFold, cross_val_score, cross_val_predict
from sklearn.metrics import confusion_matrix


def train_cross_v(md, X, y, n_folds=5, shuffl=False):

    cv = StratifiedKFold(n_splits=n_folds, shuffle=shuffl)

    scores = cross_val_score(md, X, y, scoring="f1", cv=cv, n_jobs=-1)

    y_pred = cross_val_predict(md, X, y, cv=cv, n_jobs=-1)
    print(" f1: ", scores, np.mean(scores))
    print(confusion_matrix(y_pred, y))
    return np.mean(scores)


nfolds = 2
n_examples = 1000

y = np.tile([0, 1], 500)
X = y.copy().reshape(-1, 1)

n_noisy = int(n_examples * 0.25)
X[:n_noisy] = 1 - X[:n_noisy]


train_cross_v(XGBClassifier(), X, y, n_folds=nfolds, shuffl=False)
train_cross_v(XGBClassifier(), X, y, n_folds=nfolds, shuffl=True)

So the order matters and shuffling can both increase or decrease performance.

This *could* be a nice answer, if you were not puzzlingly stating that "*the order of your data is important*" and that "*order matters*" (here we assume non-timeseries data, since in timeseries the order *does* matter indeed, and that's why we cannot shuffle such data). Shuffling works (and is always recommended) exactly because the ordering does *not* matter, but artificial ordering imposed by the data preparation process may harm the learning process - and we shuffle exactly in order to break such a (possible) ordering. — desertnaut, Sep 09 '22 at 23:43
yes exactly what i was thinking too... as if the order of the data is important shuffling should reduce f1 instead of increasing it.... but overall i think the explanation make sense... — Sunny, Sep 10 '22 at 09:30
Maybe I'm missing something but there are a lot of cases when the order does matter even with non-timeseries data. Please consider the example of linear models fitted with iterative solvers (like a stochastic gradient descent where the weights updated using a single sample at a time), online machine learning, Curriculum learning (which specifically selects the order of data different from random order that leads to better results) where the order in which data is fed into your model will affect the final performance. Artificial ordering may harm the learning process or may improve it. — u1234x1234, Sep 10 '22 at 10:09
Yes i agree with you on that for sure that the order of the data effects performances, however my actual concern was about the big positive difference between shuffling and non shuffling, as when it comes to reliability my f1 score is not reliable at all as test data f1 is closer to the one non shuffled... and this seems to be in contrast with the general rule of thumb... — Sunny, Sep 10 '22 at 15:01

Why does shuffling training data for cross validation increase performance?

1 Answers1