12

I have prepared a dataset to recognise a certain type of objects (about 2240 negative object examples and only about 90 positive object examples). However, after calculating 10 features for each object in the dataset, the number of unique training instances dropped to about 130 and 30, respectively.

Since the identical training instances actually represent different objects, can I say that this duplication holds relevant information (e.g. the distribution of object feature values), which may be useful in one way or another?

Sultan Abraham
  • 149
  • 1
  • 8

1 Answers1

17

If you omit the duplicates, that will skew the base rate of each distinct object. If the training data are a representative sample of the real world, then you don't want that, because you will actually be training for a slightly different world (one with different base rates).

To clarify the point, consider a scenario in which there are just two distinct objects. Your original data contains 99 of object A and 1 of object B. After throwing out duplicates, you have 1 object A and 1 object B. A classifier trained on the de-duplicated data will be substantially different than one trained on the original data.

My advice is to leave the duplicates in the data.

Robert Dodier
  • 16,905
  • 2
  • 31
  • 48
  • Thank you for your answer. Could you please suggest any reference that provides a further explanation of this issue? – Sultan Abraham Oct 05 '14 at 10:33
  • The training data preparation suffers from several limitations, which means that the training data is not necessarily a representative sample of the real world. Also, keeping the duplicated training instances will affect the cross-validation estimate of accuracy, as identical instances may exist in the training subset as well as the test subset. – Sultan Abraham Oct 05 '14 at 10:41
  • Last question :) With this level of imbalance, either before de-duplicating the data or after, should I use an oversampling technique? – Sultan Abraham Oct 05 '14 at 10:43
  • Hmm, a reference might be the machine learning book by Brian Ripley. Sorry, I can't cite a section or page. If the base rates for different objects are different in the real world as compared to the training data, you can compensate for that. Oversampling might indeed be useful if the base rates are very different from one object to another. About duplicate data in training and validation sets, I don't know, at the moment, what to do about that. – Robert Dodier Oct 06 '14 at 06:12