Am trying to classify 10000 samples of text into 20 classes. 4 of the classes have just 1 sample each, I tried SMOTE to address this imbalance, but I am unable to generate new samples for classes that have only one record, though I could generate samples for classes with more than 1 sample. Any suggestions?
Asked
Active
Viewed 489 times
1 Answers
0
A good explainer (and a potential answer to your question on why it might not have worked on the undersampled classes) on SMOTE can be found in this answer.
I think this issue can't be solved easily through off-the-shelf data augmentation strategies. One possibility might be to simply duplicate the example, but this would add no new information to your model.
Here are a couple other strategies you could try as well:
- An embedding-based augmentation technique (similar theory to SMOTE but works better on text data) that's described in this 2015 paper by William Wang and Diyi Yang.
- A step further on #1 using contextualized word embeddings described here in this 2017 paper by Marzieh Fadaee, Arianna Bisazza, and Christof Monz.
- Use a synonym replacement library like WordNetAug.

nlpnoah
- 26
- 4
-
Thank you @nlpnoah, i tried bert synonyms, and got some positive results. – Sandeep Reddy Mar 21 '20 at 05:58