For a binary text classification problem with imbalanced data, I use imbalanced-learn library's function RandomOverSampler
to balance the classes.
Now, I want to retrieve only the instances that were oversampled (replicated) from the original data. For example, if "item_1" is the original data and item 2 to 4 are the replicas of "item_1", I require only the indices for "item_2", "item_3", "item_4" for further processing and leave out the index for "item_1".
- item_1
- item_2
- item_3
- item_4
Here goes the my code:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=42)
X_listed = []
for eachTrainInstance in X_train:
X_listed.append([eachTrainInstance])
X_tr_resampled, y_tr_resampled = ros.fit_sample(X_listed, y_train)