Resampling dataset for spam classification

Question

I have a class imbalance problem with the following dataset:

Text                             is_it_capital?     is_it_upper?      contains_num?   Label
an example of text                      0                  0               0            0
ANOTHER example of text                 1                  1               0            1
What's happening?Let's talk at 5        1                  0               1            1

and similar. I have 5000 rows/texts (4500 with class 0 and 500 with class 1).

I would need to re-sample my classes, but I do not know where to include this step in my model, so I would appreciate if you could have a look and tell me if I am missing some step or if you spot any inconsistencies in the approach.

For train, test I am using the following:

X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.25, random_state=40)

Where X is

X=df[['Text','is_it_capital?', 'is_it_upper?', 'contains_num?']]
y=df['Label']

df_train= pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)


# Separating classes

spam = df_train[df_train.Label == 1]
not_spam = df_train[df_train.Label == 0]

# Oversampling  

oversampl = resample(spam,replace=True,n_samples=len(not_spam), random_state=42)

oversampled = pd.concat([not_spam, oversampl])
df_train = oversampled.copy()

Output (wrong?):

              precision    recall  f1-score   support

         0.0       0.94      0.98      0.96      3600
         1.0       0.76      0.52      0.62       400

    accuracy                           0.93      4000
   macro avg       0.86      0.77      0.80      4000

weighted avg       0.92      0.93      0.93      4000

Do you think there is something wrong in my steps for oversampling the dataset as confusion matrix gives me a support of 400 and not higher?

Sorry for the long post, but I think it is worthy to report all the steps for a better understanding of the approach I have taken.

It looks like you didn't use the `oversampled` variable to train your model. I think this line `logR_pipeline.fit(df_train['Text'], df_train['Label'])` should be `logR_pipeline.fit(oversampled['Text'], oversampled['Label'])`. — ygorg, Feb 17 '21 at 15:17
I am having trouble understanding what is the question you want an answer for. Do you seek advice as how to use oversampling ? Or advice on how to train your model ? How familiar are you with machine learning ? — ygorg, Feb 17 '21 at 15:18
Please make an executable example. I feel like key part of your code are missing, `build_confusion_matrix` is not defined, the `c` parameter is unused. — ygorg, Feb 17 '21 at 15:50

score 1 · Accepted Answer · answered Feb 17 '21 at 23:39

There is nothing wrong with your method and it's normal that the evaluation report shows imbalanced data. This is because:

The resampling is (rightly) done on the training set only, in order to force the model to give more importance to the minority class.
The evaluation is (rightly) made on the test set which follows the original imbalanced distribution. It would be a mistake to resample the test set as well, because the evaluation must be done on the true distribution of the data.

Resampling dataset for spam classification

1 Answers1