0

I make Logistic Regression using python scikit-learn. I have an imbalanced dataset with 2/3 of datapoints having label y=0 and 1/3 having label y=1.

I do a stratified splitting:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle=True, stratify=y)

My grid for the hyperprameter-search is:

grid = {
         'penalty': ['l1', 'l2', 'elasticnet'],
         'C': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0],
         'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
        }

Then I do a grid search including class_weight='balanced':

grid_search = GridSearchCV(
                estimator=LogisticRegression(
                                            max_iter=200,
                                            random_state=1111111111,
                                            class_weight='balanced',
                                            multi_class='auto',
                                            fit_intercept=True
                                            ),
                param_grid=grid,
                scoring=score,
                cv=5,
                refit=True
            )

My first question is regarding the score. This is the method for choosing what is the "best" classifier in the GridSearchCV, to find the best hyper-parameters. Since I performed the LogisticRegression with class_weight='balanced', should I use the classic score='accuracy', or do I still need to use score='balanced_accuracy'? And why?

So I go on and find the best classifier:

best_clf = grid_search.fit(X_train, y_train)
y_pred = best_clf.predict(X_test)

And now I want to calculate evaluation metrics, for example also the accuracy (again) and the f1-score.

Second question: Do I here need to use the "normal" accuracy/f1 or the balanced/weighted accuracy/f1?

"Normal":

acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, pos_label=1, average='binary')

Or balanced/weighted:

acc_weighted = balanced_accuracy_score(y_test, y_pred, sample_weight=y_weights)
f1_weighted = f1_score(y_test, y_pred, sample_weight=y_weights, average='weighted')

IF I should be using the balanced/weighted version, my third question regards the parameter sample_weight=y_weights. How should I set the weights? To receive a balance (although as I said I am not sure if I already have a balance achieved or not setting class_weight='balanced'), I should scale the label y=0 with 1/3 and y=1 with 2/3, right? Like this:

y_weights = [x*(1/3)+(1/3) for x in y_test]

Or should I enter here the real distribution and scale label y=0 with 2/3 and label y=1 with 1/3? Like this:

y_weights = [x*(-1/3)+(2/3) for x in y_test]

My final question is: For evaluation, what would be the baseline accuracy that I compare my accuracy to?

0.33 (class 1), 0.5 (after balancing), or 0.66 (class 0)?

Edit: With baseline I mean a model that naively classifies all data as "1" or a model that classifies all data as "0". A problem is that I don't know if I can choose freely. For example, I get an accuracy or a balanced_accuracy of 0.66. If I compare with Baseline "always 1" (acc 0.33 (?)), my model is better. If I compare with baseline "always 0" (acc 0.66 (?)), my model is worse.

Thank you all very much for helping me.

LBoss
  • 496
  • 6
  • 15
  • Why don't you balance your data before training? Working with unbalanced data may provide bad results. Use `class_weight` argument in order to balance your data i.e. provide weights for each class, as a dict, that will assure balance for the training set. – Catalina Chircu Mar 19 '20 at 21:38
  • Thanks @CatalinaChircu , so if I used `class_weight='balanced'`, do I still need to set `sample_weight=[x*(1/3)+(1/3) for x in y_true]` in f1_score()? Or not? And why? – LBoss Mar 30 '20 at 14:25
  • If you set `class_weight='balanced'` you do not need to use `sample_weight`, because class weighting is another way of balancing data, and has the same impact as resampling your training set. – Catalina Chircu Mar 30 '20 at 14:30

0 Answers0