I have a class imbalance problem with the following dataset:
Text is_it_capital? is_it_upper? contains_num? Label
an example of text 0 0 0 0
ANOTHER example of text 1 1 0 1
What's happening?Let's talk at 5 1 0 1 1
and similar. I have 5000 rows/texts (4500 with class 0 and 500 with class 1).
I would need to re-sample my classes, but I do not know where to include this step in my model, so I would appreciate if you could have a look and tell me if I am missing some step or if you spot any inconsistencies in the approach.
For train, test I am using the following:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=40)
Where X
is
X=df[['Text','is_it_capital?', 'is_it_upper?', 'contains_num?']]
y=df['Label']
df_train= pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)
# Separating classes
spam = df_train[df_train.Label == 1]
not_spam = df_train[df_train.Label == 0]
# Oversampling
oversampl = resample(spam,replace=True,n_samples=len(not_spam), random_state=42)
oversampled = pd.concat([not_spam, oversampl])
df_train = oversampled.copy()
Output (wrong?):
precision recall f1-score support
0.0 0.94 0.98 0.96 3600
1.0 0.76 0.52 0.62 400
accuracy 0.93 4000
macro avg 0.86 0.77 0.80 4000
weighted avg 0.92 0.93 0.93 4000
Do you think there is something wrong in my steps for oversampling the dataset as confusion matrix gives me a support of 400 and not higher?
Sorry for the long post, but I think it is worthy to report all the steps for a better understanding of the approach I have taken.