0

I have an imbalanced dataset of 227846 observations and 30 columns and I would like to apply smote from the smotefamily library and then tomek from the themis library on the smote data to make the dataset balanced/near balanced. The original dataset is the credit card fraud dataset from kaggle with 284807 observations and I've split the data 80:20 training testing. So the 227846 observations is actually the training data. I've successfully used smote on the training dataset to obtain a new dataset with 455184 obs and 30 columns, no missing data. However the when I try to use tomek() on the smote data, it returns 0 obs. of 30 variables. When I use dim() it returns 0 30. I have no idea what else to try. A snippet of my code is below

X <- data %>% 
  select(V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V15, 
         V16, V17, V18, V19, V20, V21, V22, V23, V24, V25, V26, V27, V28, 
         Amount)
Y <- Class

library(caTools)
set.seed(123)
d_sample <- sample.split(Class, SplitRatio=0.80)
Xtrain <- subset(X, d_sample==TRUE)
Xtest <- subset(X, d_sample==FALSE)
Ytrain <- subset(Y, d_sample==TRUE)
Ytest <- subset(Y, d_sample==FALSE)

training <- tibble(Xtrain, Ytrain)
testing <- tibble(Xtest, Ytest) 

library(smotefamily)
set.seed(123)
smote <- SMOTE(Xtrain, Ytrain, dup_size = 577)
#Xtrain is the training dataset excluding the factor variable, Ytrain is a column vector of the factor variable
#the dataset is really large so I had to use a large dup value to achieve balance

SMT <- smote$data
#SMT has no missing values

SMT$class <- as.factor(SMT$class)
library(themis)
set.seed(123)
SMTL <- themis::tomek(data.frame(SMT), var= "class") 

I've ensured that there were no missing values, the factor variable was indeed a factor, the other variables were all numeric and even used themis::tomek() and specified my dataset as a dataframe but I'm still getting 0 observations. When I tried this exact same code on a smaller dataset of about 800 observations, it worked perfectly and removed the pairs of tomek links from the smote dataset easily.

0 Answers0