I have a data set (tf-idf weighted words) with multiple classes that I try to predict. My classes are imbalanced. I would like to use the One vs. rest classification approach with some classifiers (eg. Multinomial Naive Bayes) using the OneVsRestClassifier from sklearn.
Additionally, I would like to use the imbalanced-learn package (most likely one of the combinations of up- and downsampling) to enhance my data. The normal approach of using imbalanced-learn is:
from imblearn.combine import SMOTEENN
smote_enn = SMOTEENN(random_state=0)
X_resampled, y_resampled = smote_enn.fit_resample(X, y)
I now have a data set with roughly the same number of cases for every label. I then would use the classifier on the resampled data.
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import MultinomialNB
ovr = OneVsRestClassifier(MultinomialNB())
ovr.fit(X_resampled, y_resampled)
But: now there is a huge imbalance for every label when it's fitted, because I have in total more than 50 labels. Right? I imagine that I need to apply the up-/downsampling method for every label instead of doing it once at the beginning. How can I use the resampling for every label?