0

Am trying to classify 10000 samples of text into 20 classes. 4 of the classes have just 1 sample each, I tried SMOTE to address this imbalance, but I am unable to generate new samples for classes that have only one record, though I could generate samples for classes with more than 1 sample. Any suggestions?

1 Answers1

0

A good explainer (and a potential answer to your question on why it might not have worked on the undersampled classes) on SMOTE can be found in this answer.

I think this issue can't be solved easily through off-the-shelf data augmentation strategies. One possibility might be to simply duplicate the example, but this would add no new information to your model.

Here are a couple other strategies you could try as well:

  1. An embedding-based augmentation technique (similar theory to SMOTE but works better on text data) that's described in this 2015 paper by William Wang and Diyi Yang.
  2. A step further on #1 using contextualized word embeddings described here in this 2017 paper by Marzieh Fadaee, Arianna Bisazza, and Christof Monz.
  3. Use a synonym replacement library like WordNetAug.
nlpnoah
  • 26
  • 4