Retrieve the indices for only the resampled instances after oversampling using imbalanced-learn?

Question

For a binary text classification problem with imbalanced data, I use imbalanced-learn library's function RandomOverSampler to balance the classes.

Now, I want to retrieve only the instances that were oversampled (replicated) from the original data. For example, if "item_1" is the original data and item 2 to 4 are the replicas of "item_1", I require only the indices for "item_2", "item_3", "item_4" for further processing and leave out the index for "item_1".

item_1
item_2
item_3
item_4

Here goes the my code:

from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=42)

X_listed = []
for eachTrainInstance in X_train:
    X_listed.append([eachTrainInstance])

X_tr_resampled, y_tr_resampled = ros.fit_sample(X_listed, y_train)

score 1 · Accepted Answer · answered Aug 12 '19 at 16:17

1

It seems that all the oversampled instances (and, of course, their corresponding indices) are concatenated at the end of original data subjected to oversampling.

oversampled_instances = y_tr_resampled[len(y_train):]

answered Aug 12 '19 at 16:17

PinkBanter

1,686
5
17
38

Retrieve the indices for only the resampled instances after oversampling using imbalanced-learn?

1 Answers1