0

I am using smote for the 1st time in R

I am using smote on train data having majority class which is 0 - 7952346 and minority class 1- 27230, I want to downsample such that I have 1's near to 30000 and 0's near to this range 180000-200000.

I am unable to do this can someone help me on this , I tried using different parameters but was not getting the right results as desired.

table(train$ModelLabel)

      0       1 
7952346   27230 

train2 <- SMOTE(ModelLabel ~ .,train, perc.over=100,perc.under = 600)
table(train2$ModelLabel)

     0      1 
163380  54460 

train2 <- SMOTE(ModelLabel ~ .,train, perc.over=5,perc.under = 600)
table(train2$ModelLabel)

    0     1 
 8166 28591 

train2 <- SMOTE(ModelLabel ~ .,train, perc.over=5,perc.under = 10)
table(train2$ModelLabel)

    0     1 
  136 28591 

train2 <- SMOTE(ModelLabel ~ .,train, perc.over=25,perc.under = 0)
table(train2$ModelLabel)

    0     1 
    0 34037 

train2 <- SMOTE(ModelLabel ~ .,train, perc.over=25,perc.under = 400)
table(train2$ModelLabel)

    0     1 
27228 34037 
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
Dexter1611
  • 492
  • 1
  • 4
  • 15

1 Answers1

0

If you look at the code of SMOTE:

SMOTE
function (form, data, perc.over = 200, k = 5, perc.under = 200, 
    learner = NULL, ...) 
{
    [....]
newExs <- smote.exs(data[minExs, ], ncol(data), perc.over, 
        qk)
    if (tgt < ncol(data)) {
        newExs <- newExs[, cols]
        data <- data[, cols]
    }
    selMaj <- sample((1:NROW(data))[-minExs], as.integer((perc.under/100) * 
        nrow(newExs)), replace = T)
   [...]

    newdataset <- rbind(data[selMaj, ], data[minExs, ], newExs)

So it's a pretty weird calculation, but when I tried, I found the perc.over works really weirdly. bottom line is if you would like to use this package, maybe try:

train =data.frame(matrix(rnorm((7952346+27230)*10),ncol=10))
oversample = SMOTE(ModelLabel ~ .,data=train, perc.over=120,perc.under = 0)
table(oversample$ModelLabel)

    0     1 
    0 54460

newdata = rbind(
oversample[sample(nrow(oversample),30000),],
train[sample(which(train$ModelLabel==0),180000,replace=TRUE),]
)
     0      1 
180000  30000 
StupidWolf
  • 45,075
  • 17
  • 40
  • 72