0

everyone.

I am doing a binary classification on a huge dataset (190 columns, 500K records). The target values are 0 and 1. However, when I do the oversampling with SMOTE, new target values in the y-vector are created (0, 1, 2 for example). I do not know how to avoid that. I would like to know how to limit the oversampling values to 0 and 1.

this is what I am doing:

over = SMOTE(sampling_strategy='minority')
X, y = over.fit_resample(X, y)

y's vector type is int64.

Also, I am modeling using an ANN using Keras. The training data set available for me is split into three different datasets: train, val and test, for training, validation and testing purposes. I wonder which one should be OVERsampled (or UNDERsampled), or if I should OVERsample (or UNDERsample) BEFORE splitting.

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.15)

WARNING: my model will be evaluated later on using a much larger dataset (1000K records) which is the REAL test dataset from the company I am working for, and whose target values are UNKNOWN for me and my model.

Thanx

Johnny

Marat
  • 15,215
  • 2
  • 39
  • 48
  • If there are significantly more 1s than 0s in your labels, it may be an anomaly detection problem. – Duwang Aug 27 '22 at 06:21

1 Answers1

0

Since you have 190 columns, 500K records, SMOTE will not work well as SMOTE technically interpolations techniques. It may suffer from curse of dimensionality. OVERsample and UNDERsample approaches also have some limitations. I would prefer to use class weight. Also, make sure that all data pre-processing are performed separately on test data and train data. Here is paper outlined the framework: https://www.sciencedirect.com/science/article/pii/S2666827022000585

ForestGump
  • 50
  • 2
  • 19