2

When using sklearn LogisticRegression function for binary classification of imbalanced training dataset (e.g., 85% pos class vs 15% neg class), is there a difference between setting the class_weight argument to 'balanced' vs setting it to {0:0.15, 1:0.85} ? Based on the documentations, it appears to me that using the 'balanced' argument will do the same thing as providing the dictionary.

class_weight

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).

thereandhere1
  • 205
  • 2
  • 9

2 Answers2

1

Yes, it means the same. With the class_weight='balanced parameter you don't need to pass the exact numbers and you can balance it automatically.

You can see a more extensive explanation in this link:

https://scikit-learn.org/dev/glossary.html#term-class-weight

To confirm that the similarity of the next attributes:

  • class_weight = 'balanced'
  • class_weight = {0:0.5, 1:0.5}
  • class_weight = None

I have generated this experiment:

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

X, y = load_iris(return_X_y=True)
clf_balanced = LogisticRegression(class_weight='balanced', random_state=0).fit(X, y)
clf_custom = LogisticRegression(class_weight={0:0.5,1:0.5}, random_state=0).fit(X, y)
clf_none = LogisticRegression(class_weight=None, random_state=0).fit(X, y)

print('Balanced:',clf_balanced.score(X, y))
print('Custom:',clf_custom.score(X, y))
print('None:',clf_none.score(X, y))

And the ouput is:

Balanced: 0.9733333333333334
Custom:   0.9733333333333334
None:     0.9733333333333334

So, we can conclude empirically that they are the same.

IMB
  • 519
  • 4
  • 19
  • does class_weight='balanced' adjust the number of small class sample, or adjusts the learned weights? if the the training dataset is balanced, does it mean that the following three configurations are the same: 1) class_weight='balanced' 2) {0:0.5,1:0,5} 3) class_weight=None? – thereandhere1 Jun 16 '20 at 01:48
  • 1
    As I understood in the link I just put in my answer it will adjust the loss function by weighting the loss of each sample by its class weight. – IMB Jun 16 '20 at 11:04
  • 1
    @IBM Thank you. This makes sense. Are you also able to verify my 2nd question: If the the training dataset is balanced, does it mean that the following three configurations are the same: 1) class_weight='balanced' 2) {0:0.5,1:0,5} 3) class_weight=None? – thereandhere1 Jun 16 '20 at 14:17
0

The answer accepted in this thread is incorrect. When balanced is given as argument, sklearn computes the weights based on: weight of class = total data points/(number of classes * number of samples of class)

The example illustrated takes a balanced dataset. If you pass balanced, weights of each class based on above formula = 1 whereas when passed manually, weights of each class = 0.5

The results of the model don't change, because the regularization parameter C changes value based on 1/C = Sum of weights