themis::tomek() returning 0 observations

Question

I have an imbalanced dataset of 227846 observations and 30 columns and I would like to apply smote from the smotefamily library and then tomek from the themis library on the smote data to make the dataset balanced/near balanced. The original dataset is the credit card fraud dataset from kaggle with 284807 observations and I've split the data 80:20 training testing. So the 227846 observations is actually the training data. I've successfully used smote on the training dataset to obtain a new dataset with 455184 obs and 30 columns, no missing data. However the when I try to use tomek() on the smote data, it returns 0 obs. of 30 variables. When I use dim() it returns 0 30. I have no idea what else to try. A snippet of my code is below

X <- data %>% 
  select(V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V15, 
         V16, V17, V18, V19, V20, V21, V22, V23, V24, V25, V26, V27, V28, 
         Amount)
Y <- Class

library(caTools)
set.seed(123)
d_sample <- sample.split(Class, SplitRatio=0.80)
Xtrain <- subset(X, d_sample==TRUE)
Xtest <- subset(X, d_sample==FALSE)
Ytrain <- subset(Y, d_sample==TRUE)
Ytest <- subset(Y, d_sample==FALSE)

training <- tibble(Xtrain, Ytrain)
testing <- tibble(Xtest, Ytest) 

library(smotefamily)
set.seed(123)
smote <- SMOTE(Xtrain, Ytrain, dup_size = 577)
#Xtrain is the training dataset excluding the factor variable, Ytrain is a column vector of the factor variable
#the dataset is really large so I had to use a large dup value to achieve balance

SMT <- smote$data
#SMT has no missing values

SMT$class <- as.factor(SMT$class)
library(themis)
set.seed(123)
SMTL <- themis::tomek(data.frame(SMT), var= "class")

I've ensured that there were no missing values, the factor variable was indeed a factor, the other variables were all numeric and even used themis::tomek() and specified my dataset as a dataframe but I'm still getting 0 observations. When I tried this exact same code on a smaller dataset of about 800 observations, it worked perfectly and removed the pairs of tomek links from the smote dataset easily.

Please provide enough code so others can better understand or reproduce the problem. — Community, Mar 03 '23 at 08:52

themis::tomek() returning 0 observations

0 Answers0