I write this code:
LR=LogisticRegression()
pipe_lr= Pipeline ([
('oversampling', SMOTE()),
('LR', LR)
])
C_list_lr=[0.001, 0.01, 0.1, 1, 10, 100 ]
solver_list_lr=[ 'liblinear', 'newton-cg', 'saga']
penalty_list_lr=[None, 'elasticnet', 'l1', 'l2']
max_iter_list_lr=[100, 1000, 3000]
random_state_list_lr=[None, 20, 42 ]
param_grid_lr = {
'LR__C': C_list_lr,
'LR__solver': solver_list_lr,
'LR__penalty': penalty_list_lr,
'LR__max_iter': max_iter_list_lr,
'LR__random_state': random_state_list_lr
}
grid_lr = GridSearchCV(pipe_lr, param_grid_lr, cv=5, scoring='accuracy', return_train_score=False)
grid_lr.fit(x1_train, y1_train)
I have two questions:
- Is the code correct?
- Is it normal to obtain a lower accuracy score working in this way than simply using
LogisticRegression
with parameters chosen by myself and without oversampling?
I work with a set containing 4024 data. It is a binary classification problem and I have ~3400 examples in one class and just 624 in the second one. When I implemented the same algorithm on the dataset without any over/under-sampling, I received 0.89, but after oversampling and GridSearchCV
just 0.83