Text classification with imbalanced data

Question

Am trying to classify 10000 samples of text into 20 classes. 4 of the classes have just 1 sample each, I tried SMOTE to address this imbalance, but I am unable to generate new samples for classes that have only one record, though I could generate samples for classes with more than 1 sample. Any suggestions?

score 0 · Answer 1 · answered Mar 16 '20 at 20:04

A good explainer (and a potential answer to your question on why it might not have worked on the undersampled classes) on SMOTE can be found in this answer.

I think this issue can't be solved easily through off-the-shelf data augmentation strategies. One possibility might be to simply duplicate the example, but this would add no new information to your model.

Here are a couple other strategies you could try as well:

An embedding-based augmentation technique (similar theory to SMOTE but works better on text data) that's described in this 2015 paper by William Wang and Diyi Yang.
A step further on #1 using contextualized word embeddings described here in this 2017 paper by Marzieh Fadaee, Arianna Bisazza, and Christof Monz.
Use a synonym replacement library like WordNetAug.

Thank you @nlpnoah, i tried bert synonyms, and got some positive results. — Sandeep Reddy, Mar 21 '20 at 05:58

Text classification with imbalanced data

1 Answers1