Ignore columns in SMOTE oversampling

Question

I am having six feature columns and one target column, which is imbalanced. Can I make oversampling method like ADASYN or SMOTE by creating synthetic records only for the four columns X1,X2,X3,X4 by copying exactly the same as constant (Month, year column)

Current one:

Expected one: It can create synthetic records by up-sampling target class '1' but the number of records can increase but the added records should have month and years (unchanged as shown below )

If you only want to upsample certain columns, you could split out the columns you want to upsample into a separate dataframe, then upsample them, and add back in the other columns after you've upsampled the separate dataframe. — Bobs Burgers, Jun 23 '20 at 14:16
but the unchanged column and the upsampled dataframe will have different number of rows , isn't it ? I need to have same number of records for both constant columns and the one which i need to create in synthetic ( while creating synthetic records they should copy the entries of my constant columns as it is ). hope this clarifies the requirement . Can We have any passing like parameter columns inside ADASYN, SMOTE or any other methods. — Ayyasamy, Jun 23 '20 at 14:21
from the second image above, it creates records (3,4,5,6) from record 2 trying the balance the datasets . while making such using any techniques, i don't want to change the values (month :10, year :2000 ) in those two columns but can make any values for columns X1,X2,X3 & X4. This balances but without changing the values of two columns while copying — Ayyasamy, Jun 23 '20 at 14:30

desertnaut · Accepted Answer · 2020-06-23T15:42:00.870

From a programming perspective, an identical question asked in the relevant Github repo back in 2017 was answered negatively:

[Question]

I have a data frame that I want to apply smote to but I wish to only use a subset of the columns. The other columns contain additional data for each sample and I want each new sample to contain the original info as well

[Answer]

There is no way to do that apart of extracting the column in a new matrix and process it with SMOTE. Even if you generate a new samples you have to decide what to put as values there so I don't see how such feature can be added

Answering from a modelling perspective, this is not a good idea and, even if you could find a programming workaround, you should not attempt it - and arguably, this is the reason why the developer of imbalanced-learn above was dismissive even in the thought of adding such a feature in the SMOTE implementation.

Why is that? Well, synthetic oversampling algorithms, like SMOTE, essentially use some variant of a k-nn approach in order to create artificial samples "between" the existing ones. Given this approach, it goes without saying that, in order for these artificial samples to be indeed "between" the real ones (in a k-nn sense), all the existing (numerical) features must be taken into account.

If, by employing some programming alchemy, you manage at the end to produce new SMOTE samples based only on a subset of your features, putting the unused features back in will destroy any notion of proximity and "betweenness" of these artificial samples to the real ones, thus compromising the whole enterprise by inserting a huge bias in your training set.

In short:

If you think your Month and year are indeed useful features, just include them in SMOTE; you may get some nonsensical artificial samples, but this should not be considered a (big) problem for the purpose here.
If not, then maybe you should consider removing them altogether from your training.

Ignore columns in SMOTE oversampling

1 Answers1