sklearn logistic regression with unbalanced classes

Question

I'm solving a classification problem with sklearn's logistic regression in python.

My problem is a general/generic one. I have a dataset with two classes/result (positive/negative or 1/0), but the set is highly unbalanced. There are ~5% positives and ~95% negatives.

I know there are a number of ways to deal with an unbalanced problem like this, but have not found a good explanation of how to implement properly using the sklearn package.

What I've done thus far is to build a balanced training set by selecting entries with a positive outcome and an equal number of randomly selected negative entries. I can then train the model to this set, but I'm stuck with how to modify the model to then work on the original unbalanced population/set.

What are the specific steps to do this? I've poured over the sklearn documentation and examples and haven't found a good explanation.

score 23 · Answer 1 · edited Oct 01 '15 at 13:42

23

Have you tried to pass to your class_weight="auto" classifier? Not all classifiers in sklearn support this, but some do. Check the docstrings.

Also you can rebalance your dataset by randomly dropping negative examples and / or over-sampling positive examples (+ potentially adding some slight gaussian feature noise).

edited Oct 01 '15 at 13:42

Matthew Murdoch

30,874
30
96
127

answered Feb 13 '13 at 22:34

ogrisel

39,309
12
116
125

1

Yes, class_weight='auto' works great. Is there any advantage to not use the built-in/black-box auto weight but instead to rebalance the training set (as I originally did)? Regardless, if I took the approach of balancing the training set, how do I adjust the fit/trained model to apply to an unbalanaced test set? – agentscully Feb 23 '13 at 05:17
9

It's not that black box: it just re-weighting the samples in the empirical objective function being optimized by the algorithm. Under-sampling over-represented classes is good because training is faster :) but you are dropping data which is bad, especially if your model is already in an overfitting regime (significant gap between train and test scores). Over-sampling is in generally mathematically equivalent to re-weighting but slower because of duplicated operations. – ogrisel Feb 23 '13 at 14:42

score 9 · Answer 2 · answered Jun 15 '16 at 02:26

@agentscully Have you read the following paper,

[SMOTE] (https://www.jair.org/media/953/live-953-2037-jair.pdf). I have found the same very informative. Here is the link to the Repo. Depending on how you go about balancing your target classes, either you can use

'auto': (is deprecated in the newer version 0.17) or 'balanced' or specify the class ratio yourself {0: 0.1, 1: 0.9}.
'balanced': This mode adjusts the weights inversely proportional to class frequencies n_samples / (n_classes * np.bincount(y)

Let me know, if more insight is needed.

sklearn logistic regression with unbalanced classes

2 Answers2

Linked