the definition of unbalanced sample

Question

Unbalanced sample causes issues and more efforts as we know.

When I am handling the issue， I am confused about the definition. Say, I have a training dataset of 200 cats, 200 dogs and 400 stones. When I am to classify the dataset, when classfying 3 classesm I　should have 200 cats, 200 dogs and 200 stones, what should I allocate when I am just to classify 2 classes of pets and stones?

Should I still go with 400 pets (w/ 200 cats & 200 dogs) and 400 stones? make class pets and stones has same quantities.

or should I go with 400 pets (w/ 200 cats & 200 dogs) and 200 stones? or make all inner classes have the same probability to be watched, after all, cats and dogs are essentitally different.

score 1 · Answer 1 · answered Dec 05 '20 at 07:49

I think it is task dependent, if you are going to classify your samples into two classes (pets and stones) then you must use all 400 pet images (cats and dogs) and the 400 stone samples. However, if you are having three classes: cats, dogs, and stones; then you need to limit the number of stone sample to 200 for eavery training epoch.

Why this? In the case of two classes pets vs stones: both labels (pet and stone) update the weights of the models 400 times for each epoch. So after the training finishes, the model will be able to regognize both classes equivalently.

In the case of three classes (cats, dogs, and stones) the cat and dog classes update the wights 200 times per epoch while the stone class update the weights 400 times per epoch, so the model will have a higher chance of outputing the stone class than outputing the cat or dog class.

So, in summary, you should make the number of samples the same for all classes.

PS: if you randomly select 200 stone samples from the 400 ones in the case of three classes, your model won't end up biased to the stone class compared to the other two classes, however it will generalize better on the stone class compared to the other two because it has seen more unique samples of this class.

Thanks, @SELLAM, I am thinking about an experinment for your suggestion to check if the prob will be higher if more samples under a class is provided. For now, what you said is quite convinsing. Guess that's the reason why we need to balance the sample. — Grec001, Dec 05 '20 at 08:06

the definition of unbalanced sample

1 Answers1