Confusion Matrix : Shuffle vs Non-Shuffle

Question

Here is the config of my model :

"model": {
        "loss": "categorical_crossentropy",
        "optimizer": "adam",
        "layers": [
            {
                "type": "lstm",
                "neurons": 180,
                "input_timesteps": 15,
                "input_dim": 103,
                "return_seq": true,
                "activation": "relu"
            },
            {
                "type": "dropout",
                "rate": 0.1
            },
            {
                "type": "lstm",
                "neurons": 100,
                "activation": "relu",
                "return_seq": false
            },
            {
                "type": "dropout",
                "rate": 0.1
            },
            {
                "type": "dense",
                "neurons": 30,
                "activation": "relu"
            },
            {
                "type": "dense",
                "neurons": 3,
                "activation": "softmax"
            }
        ]
    }

Once I finished to train a model, I decided to compare what the confusion matrix looks like if I shuffle or not the dataset and the labels.

I shuffled with the line

from sklearn.utils import shuffle
X, label = shuffle(X, label, random_state=0)

Be aware X and label are two testing sets. So it is not related to the training sets.

Confusion matrix with a shuffling phase

Confusion Matrix
[[16062  1676  3594]
 [ 1760  4466  1482]
 [ 3120  1158 13456]]
Classification Report
             precision    recall  f1-score   support

   class -1       0.77      0.75      0.76     21332
    class 0       0.61      0.58      0.60      7708
    class 1       0.73      0.76      0.74     17734

avg / total       0.73      0.73      0.73     46774

Confusion matrix without a shuffling phase

Confusion Matrix
[[12357  2936  6039]
 [ 1479  4301  1927]
 [ 3316  1924 12495]]
Classification Report
             precision    recall  f1-score   support

   class -1       0.72      0.58      0.64     21332
    class 0       0.47      0.56      0.51      7707
    class 1       0.61      0.70      0.65     17735

avg / total       0.64      0.62      0.62     46774

As you can see here, the precision for both reports are significantly different. What can explain the gap between those two reports?

Is your question simply: Why does shuffeling during training help for neural network training? — Martin Thoma, Jan 26 '19 at 07:30

score 0 · Answer 1 · answered Apr 15 '20 at 11:28

Data shuffling never hurts performance, and it very often helps, the reason being that it breaks possible biases during data preparation - e.g. putting all the cat images first and then the dog ones in a cat/dog classification dataset.

Take for example the famous iris dataset:

from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
y
# result:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

As you can clearly see, the dataset has been prepared in such a way that the first 50 samples are all of label 0, the next 50 of label 1, and the last 50 of label 2. Try to perform a 5-fold cross validation in such a dataset without shuffling and you'll find most of your folds containing only a single label; try a 3-fold CV, and all your folds will include only one label. Bad... BTW, it's not just a theoretical possibility, it has actually happened.

Since it's very difficult to know beforehand that such bias may exist in our dataset, we always shuffle (as said, it never hurts), just to be on the safe side, and that's why shuffling is a standard procedure in all machine learning pipelines.

So, even if the situation here obviously depends on the details of your data (which we don't know), this behavior is not at all surprising - on the contrary, it is totally expected.

score -1 · Answer 2 · edited Jan 26 '19 at 08:13

-1

Your number of class 0 and class 1 for both confusion matrix is off by one.

You need to make sure that there is no mistake on matching the data to the class label.

edited Jan 26 '19 at 08:13

Bhargav Rao

50,140
28
121
140

answered Jan 25 '19 at 22:49

Aiden Zhao

633
4
15

Be aware `X` and `label` are two testing sets. So it is not related to the training sets. – fgauth Jan 25 '19 at 22:51
1

There no error. You have the full confusion matrix. So you can rebuild the confusion report by yourself. It seems to be totally fine. – fgauth Jan 25 '19 at 23:06
then why is the total off by 1? – Aiden Zhao Jan 25 '19 at 23:17
From the total support entries of both classification reports it is clear that the total samples are 46774 in both cases. In addition, `sum([sum(x) for x in cm])` gives 46774 for both confusion matrices `cm` shown. OP is right, there is no error - kindly delete this answer. – desertnaut Apr 15 '20 at 11:37

Confusion Matrix : Shuffle vs Non-Shuffle

2 Answers2

Linked