5

I'm performing some (binary)text classification with two different classifiers on the same unbalanced data. i want to compare the results of the two classifiers.

When using sklearns logistic regression, I have the option of setting the class_weight = 'balanced' for sklearn naive bayes, there is no such parameter available.

I know, that I can just randomly sample from the bigger class in order to end up with equal sizes for both classes, but then the data is lost.

Why is there no such parameter for naive bayes? I guess it has something to do with the nature of the algorithm, but cant find anything about this specific matter. I also would like to know what the equivalent would be? How to achieve a similar effect (that the classifier is aware of the imbalanced data and gives more weight to the minority class and less to the majority class)?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
dumbchild
  • 275
  • 4
  • 11

2 Answers2

3

I'm writing this partially in response to the other answer here.

Logistic regression and naive Bayes are both linear models that produce linear decision boundaries.

Logistic regression is the discriminative counterpart to naive Bayes (a generative model). You decode each model to find the best label according to p(label | data). What sets Naive Bayes apart is that it does this via Bayes' rule: p(label | data) ∝ p(data | label) * p(label).

(The other answer is right to say that the Naive Bayes features are independent of each other (given the class), by the Naive Bayes assumption. With collinear features, this can sometimes lead to bad probability estimates for Naive Bayes—though the classification is still quite good.)

The factoring here is how Naive Bayes handles class imbalance so well: it's keeping separate books for each class. There's a parameter for each (feature, label) pair. This means that the super-common class can't mess up the super-rare class, and vice versa.

There is one place that the imbalance might seep in: the p(labels) distribution. It's going to match the empirical distribution in your training set: if it's 90% label A, then p(A) will be 0.9. If you think that the training distribution of labels isn't representative of the testing distribution, you can manually alter the p(labels) values to match your prior belief about how frequent label A or label B, etc., will be in the wild.

Arya McCarthy
  • 8,554
  • 4
  • 34
  • 56
  • thank you for your answer. one thing i dont get is how "it's keeping separate books for each class". could you explain that a bit further, please? – dumbchild Feb 23 '21 at 08:09
  • 1
    Good question! Naive Bayes has separate parameters for each (feature, label) pair to model p(feature | label). So if the possible labels are J, K, and L, then there are parameters for p(feature1 | J), p(feature1 | K), and p(feature1 | L). – Arya McCarthy Feb 23 '21 at 13:56
  • Thanks very much - I'd completely forgotten that Naive Bayes is a linear classifier (for some reason I thought the decision boundary was quadratic). – Cecil Cox Feb 23 '21 at 19:09
  • No worries! I know the feeling. :D – Arya McCarthy Feb 23 '21 at 19:53
  • thank you, @AryaMcCarthy, this comment really helped me – dumbchild Feb 25 '21 at 17:30
2

Logistic Regression is a linear model, ie it draws a straight line through your data and the class of a datum is determined by which side of the line it's on. This line is just a linear combination (a weighted sum) of your features, so we can adjust for imbalanced data by adjusting the weights.

Naïve Bayes, on the other hand, works by calculating the conditional probability of labels given individual features, then uses the Naïve Bayes assumption(features are independent) to calculate the probability of a datum having a particular label (by multiplying the conditional probability of each feature and scaling). There is no obvious parameter to adjust to account for imbalanced classes.

Instead of undersampling, you could try oversampling - expanding the smaller class with duplicates or slightly adjusted data or look into other approaches based on your problem domain (since you're doing text classification, these answers have some suggested approaches).

Cecil Cox
  • 376
  • 1
  • 7
  • thank you for your answer, especially for the terms undersampling and oversampling! I didnt know that there are names for it :) now I can get even more infos about it – dumbchild Feb 25 '21 at 17:31