Data shuffling for Image Classification

Question

I want to develop a CNN model to identify 24 hand signs in American Sign Language. I created a custom dataset that contains 3000 images for each hand sign i.e. 72000 images in the entire dataset.

For training the model, I would be using 80-20 dataset split (2400 images/hand sign in the training set and 600 images/hand sign in the validation set).

My question is: Should I randomly shuffle the images when creating the dataset? And Why?

Based on my previous experience, it led to validation loss being lower than training loss and validation accuracy more than training accuracy. Check this link.

desertnaut · Accepted Answer · 2020-04-15T10:57:36.927

Random shuffling of data is a standard procedure in all machine learning pipelines, and image classification is not an exception; its purpose is to break possible biases during data preparation - e.g. putting all the cat images first and then the dog ones in a cat/dog classification dataset.

Take for example the famous iris dataset:

from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
y
# result:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

As you can clearly see, the dataset has been prepared in such a way that the first 50 samples are all of label 0, the next 50 of label 1, and the last 50 of label 2. Try to perform a 5-fold cross validation in such a dataset without shuffling and you'll find most of your folds containing only a single label; try a 3-fold CV, and all your folds will include only one label. Bad... BTW, it's not just a theoretical possibility, it has actually happened.

Even if no such bias exists, shuffling never hurts, so we do it always just to be on the safe side (you never know...).

Based on my previous experience, it led to validation loss being lower than training loss and validation accuracy more than training accuracy. Check this link.

As noted in the answer there, it is highly unlikely that this was due to shuffling. Data shuffling is not anything sophisticated - essentially, it is just the equivalent of shuffling a deck of cards; it may have happened once that you insisted on "better" shuffling and subsequently you ended up with a straight flush hand, but obviously this was not due to the "better" shuffling of the cards.

thanks, so how to determine if the model is well trained in that case (val acc is more than training acc)? — mayuresh_sa, Apr 15 '20 at 10:34
@mayuresh_sa that's another question altogether, which you have already posted elsewhere and got an answer. Very briefly, there is not *necessarily* anything wrong in such a situation (already pointed out in the response). And your learning curves seem OK. — desertnaut, Apr 15 '20 at 10:40
`Even if no such bias exists, shuffling never hurts` - Actually it is not generally true - please consider the example of Curriculum Learning when the shuffling could break the curriculum and lead to worse results - https://ronan.collobert.com/pub/2009_curriculum_icml.pdf — u1234x1234, Sep 10 '22 at 09:57

score -1 · Answer 2 · edited Apr 15 '20 at 09:54

-1

Here is my two cents on the topic.

First of all make sure to extract a test set that has equal number of samples for each hand sign. (hand sign #1 - 500 samples, hand sign #2 - 500 samples and so on) I think this is referred to as stratified sampling.

When it comes to the training set, there is no huge mistake in shuffling the entire set. However, when splitting the training set into training and validation set make sure that the validation set is good enough to be a representation for the test set.

One of my personal experiences with shuffling: After splitting the training set into training and validation sets, the validation set turned out to be very easy to predict. Therefore, I saw good learning metric values. However, the performance of the model on the test set was horrible.

edited Apr 15 '20 at 09:54

desertnaut

57,590
26
140
166

answered Apr 15 '20 at 03:54

Bilguun

359
2
7

What do you mean "there is no huge *mistake* in shuffling"? Is there a small one? And what has the last para to do with shuffling, in particular, which is what the question is about (and not splitting in general)? You seem to attribute a horrible test performance (despite a good validation one) to shuffling... – desertnaut Apr 15 '20 at 09:52
In my dataset, I tried to have a harder dataset for validation set that would represent the test set, but then the model does not converge even with a complex model like Conv2D(64, 128, 256, 512 - 3x3)>Dense(512)>Dense(24). Each Conv2D is followed by Maxpool2D(2x2) – mayuresh_sa Apr 15 '20 at 10:40
@desertnaut By "there is no huge mistake in shuffling", I meant you do not have to worry about it. Excuse me for not being clear, my main idea was that there are cases when the validation set is too easy to predict. It might be caused by wrong train/validation split method. Ultimately, I wanted to inform mayuresh_sa about this. – Bilguun Apr 16 '20 at 02:29
@mayuresh_sa I see. Although, I am not an expert by any means, I think you should not explicitly try to have a harder validation set. Balanced train/validation sets work for me usually. In addition, I guess the difficult samples are easy to distinguish than the rest? Can you provide us difficult/moderate to predict sample ratios for train/validation/test? – Bilguun Apr 16 '20 at 02:34
@Bilguun I did not clearly understand the last question. I am using 80-20 train-validation split. By harder dataset - I have hand signs with plain backgrounds in training set (white and red) and hand signs with different background patterns in validation set. You can check the dataset on https://www.kaggle.com/mayureshamberkar/sign-language-dataset-24-signs-72000-images. And if I shuffled all the images - val accuracy is higher than training acc since the beginning of training and was answered in another question (link already added in the this question) – mayuresh_sa Apr 16 '20 at 09:27
Please **edit & update** your answer to clarify exactly what you mean; "*there is no huge mistake*" hardly renders as "*don't worry*", but rather as "*it's a mistake, but you can live with it*", which is clearly wrong. Kindly be reminded that SO is not a discussion forum, and answers are expected to be accurate and to the point of what is being asked. Storytelling stuff must be kept to an absolute minimum and always related to the question, which here is clearly about shuffling, *not* splitting. – desertnaut Apr 16 '20 at 12:19

Data shuffling for Image Classification

2 Answers2

Linked

Related