Using pipeline, SMOTE, and GridSearchCV together

Question

I write this code:

LR=LogisticRegression()

pipe_lr= Pipeline ([
    ('oversampling', SMOTE()),
    ('LR', LR)
])

C_list_lr=[0.001, 0.01, 0.1, 1, 10, 100 ]
solver_list_lr=[ 'liblinear', 'newton-cg', 'saga']
penalty_list_lr=[None, 'elasticnet', 'l1', 'l2']
max_iter_list_lr=[100, 1000, 3000]
random_state_list_lr=[None, 20, 42 ]
param_grid_lr = {
    'LR__C': C_list_lr, 
    'LR__solver': solver_list_lr,
    'LR__penalty': penalty_list_lr,
    'LR__max_iter': max_iter_list_lr,
    'LR__random_state': random_state_list_lr
}

grid_lr = GridSearchCV(pipe_lr, param_grid_lr, cv=5, scoring='accuracy', return_train_score=False)
grid_lr.fit(x1_train, y1_train)

I have two questions:

Is the code correct?
Is it normal to obtain a lower accuracy score working in this way than simply using LogisticRegression with parameters chosen by myself and without oversampling?

I work with a set containing 4024 data. It is a binary classification problem and I have ~3400 examples in one class and just 624 in the second one. When I implemented the same algorithm on the dataset without any over/under-sampling, I received 0.89, but after oversampling and GridSearchCV just 0.83

Alexander L. Hayes · Answer 1 · 2023-01-01T22:33:43.710

Brief answers:

The code does not make any egregious errors: using a Pipeline helps avoid most of the worst mistakes. However: the parameter grid contains combinations that will result in a large number of NaN values (e.g. missing l1_ratio for penalty="elasticnet"), so I've added suggestions below.
This is possible, but remember: accuracy is not a good metric on imbalanced learning problems since it is sensitive to class proportions. SMOTE also modifies the feature space during learning, so simpler baselines like ROS/RUS are worth testing.

Here's a grid search using the saga solver (which supports all penalty parameters) that selects for balanced accuracy:

from imblearn.pipeline import Pipeline

pipe_lr = Pipeline ([
    ('scale', StandardScaler()),     # L2 is problematic with non-normal data
    ('oversampling', SMOTE()),
    ('LR', LogisticRegression(solver="saga")),
])

param_grid_lr = {
    'LR__C': [0.001, 0.01, 0.1],
    'LR__l1_ratio': [0.2, 0.4, 0.6, 0.8],
    'LR__penalty': [None, 'elasticnet', 'l1', 'l2'],
}

grid_lr = GridSearchCV(pipe_lr, param_grid_lr, cv=5, scoring='balanced_accuracy', verbose=3)

import warnings
with warnings.catch_warnings():
    # Warning filter is optional, but fit will warn when parameters go unused.
    warnings.simplefilter("ignore")
    grid_lr.fit(X, y)

print(grid_lr.best_params_)

Thank you. One more question pls. Is it ok train test split done in this way? x_train_over, x_test_over, y_train_over, y_test_over= train_test_split(x, y, test_size=0.2, random_state=42) Or should I use some cross validation?( I m a beginner — Deniss, Jan 02 '23 at 00:33
GridSearchCV performs cross validation internally. Doing a train_test_split like that is good if you want to report your final model's performance on a hold-out set. — Alexander L. Hayes, Jan 02 '23 at 00:39
Tyy! And yes, for 2) , not only accuracy become worse after smote, almost all of them (recall, F1..) — Deniss, Jan 02 '23 at 01:15
That can happen, so "how to handle imbalance" is also something to tune. If you have uninformative features, SMOTE can sometimes exacerbate problems too. — Alexander L. Hayes, Jan 02 '23 at 02:00

Using pipeline, SMOTE, and GridSearchCV together

1 Answers1