Is duplicating data a valid way to fix bias?

Question

I’m reading a paper in the area of engineering. They have a labelled dataset which is biased. There are many more instances labelled A than B. They want to train a classifier to predict the A or B label based on some inputs (states).

The authors say:

To artiﬁcially remedy this problem, random replicas of the B states are incorporated into the dataset to even out the lot.

I don’t know much on data analytics, but this doesn’t sound very valid to me. Is it?

score 1 · Accepted Answer · answered Jan 03 '20 at 09:23

1

This type of data normally called as imbalanced data. what author said was right to deal with imbalanced data we need to add some duplication to bring as a balanced(but instead of adding randomly will see the data patterns and add the data). there many algorithms methods to deal with imbalance classification just go through this it might help you https://datascience.stackexchange.com/questions/24392/why-we-need-to-handle-data-imbalance

answered Jan 03 '20 at 09:23

venkatadileep

183
7

Thanks! That's what I was looking for. I think my problem was that I was looking for keyword "biased" instead of imbalance. – electrique Jan 04 '20 at 11:30

Is duplicating data a valid way to fix bias?

1 Answers1