How to do an evaluation of Logistic Regression with imbalanced dataset using sklearn?

Question

I make Logistic Regression using python scikit-learn. I have an imbalanced dataset with 2/3 of datapoints having label y=0 and 1/3 having label y=1.

I do a stratified splitting:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle=True, stratify=y)

My grid for the hyperprameter-search is:

grid = {
         'penalty': ['l1', 'l2', 'elasticnet'],
         'C': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0],
         'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
        }

Then I do a grid search including class_weight='balanced':

grid_search = GridSearchCV(
                estimator=LogisticRegression(
                                            max_iter=200,
                                            random_state=1111111111,
                                            class_weight='balanced',
                                            multi_class='auto',
                                            fit_intercept=True
                                            ),
                param_grid=grid,
                scoring=score,
                cv=5,
                refit=True
            )

My first question is regarding the score. This is the method for choosing what is the "best" classifier in the GridSearchCV, to find the best hyper-parameters. Since I performed the LogisticRegression with class_weight='balanced', should I use the classic score='accuracy', or do I still need to use score='balanced_accuracy'? And why?

So I go on and find the best classifier:

best_clf = grid_search.fit(X_train, y_train)
y_pred = best_clf.predict(X_test)

And now I want to calculate evaluation metrics, for example also the accuracy (again) and the f1-score.

Second question: Do I here need to use the "normal" accuracy/f1 or the balanced/weighted accuracy/f1?

"Normal":

acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, pos_label=1, average='binary')

Or balanced/weighted:

acc_weighted = balanced_accuracy_score(y_test, y_pred, sample_weight=y_weights)
f1_weighted = f1_score(y_test, y_pred, sample_weight=y_weights, average='weighted')

IF I should be using the balanced/weighted version, my third question regards the parameter sample_weight=y_weights. How should I set the weights? To receive a balance (although as I said I am not sure if I already have a balance achieved or not setting class_weight='balanced'), I should scale the label y=0 with 1/3 and y=1 with 2/3, right? Like this:

y_weights = [x*(1/3)+(1/3) for x in y_test]

Or should I enter here the real distribution and scale label y=0 with 2/3 and label y=1 with 1/3? Like this:

y_weights = [x*(-1/3)+(2/3) for x in y_test]

My final question is: For evaluation, what would be the baseline accuracy that I compare my accuracy to?

0.33 (class 1), 0.5 (after balancing), or 0.66 (class 0)?

Edit: With baseline I mean a model that naively classifies all data as "1" or a model that classifies all data as "0". A problem is that I don't know if I can choose freely. For example, I get an accuracy or a balanced_accuracy of 0.66. If I compare with Baseline "always 1" (acc 0.33 (?)), my model is better. If I compare with baseline "always 0" (acc 0.66 (?)), my model is worse.

Thank you all very much for helping me.

Why don't you balance your data before training? Working with unbalanced data may provide bad results. Use `class_weight` argument in order to balance your data i.e. provide weights for each class, as a dict, that will assure balance for the training set. — Catalina Chircu, Mar 19 '20 at 21:38
Thanks @CatalinaChircu , so if I used `class_weight='balanced'`, do I still need to set `sample_weight=[x*(1/3)+(1/3) for x in y_true]` in f1_score()? Or not? And why? — LBoss, Mar 30 '20 at 14:25
If you set `class_weight='balanced'` you do not need to use `sample_weight`, because class weighting is another way of balancing data, and has the same impact as resampling your training set. — Catalina Chircu, Mar 30 '20 at 14:30

How to do an evaluation of Logistic Regression with imbalanced dataset using sklearn?

0 Answers0