23

I'm solving a classification problem with sklearn's logistic regression in python.

My problem is a general/generic one. I have a dataset with two classes/result (positive/negative or 1/0), but the set is highly unbalanced. There are ~5% positives and ~95% negatives.

I know there are a number of ways to deal with an unbalanced problem like this, but have not found a good explanation of how to implement properly using the sklearn package.

What I've done thus far is to build a balanced training set by selecting entries with a positive outcome and an equal number of randomly selected negative entries. I can then train the model to this set, but I'm stuck with how to modify the model to then work on the original unbalanced population/set.

What are the specific steps to do this? I've poured over the sklearn documentation and examples and haven't found a good explanation.

Zombo
  • 1
  • 62
  • 391
  • 407
agentscully
  • 231
  • 1
  • 2
  • 3

2 Answers2

23

Have you tried to pass to your class_weight="auto" classifier? Not all classifiers in sklearn support this, but some do. Check the docstrings.

Also you can rebalance your dataset by randomly dropping negative examples and / or over-sampling positive examples (+ potentially adding some slight gaussian feature noise).

Matthew Murdoch
  • 30,874
  • 30
  • 96
  • 127
ogrisel
  • 39,309
  • 12
  • 116
  • 125
  • 1
    Yes, class_weight='auto' works great. Is there any advantage to not use the built-in/black-box auto weight but instead to rebalance the training set (as I originally did)? Regardless, if I took the approach of balancing the training set, how do I adjust the fit/trained model to apply to an unbalanaced test set? – agentscully Feb 23 '13 at 05:17
  • 9
    It's not that black box: it just re-weighting the samples in the empirical objective function being optimized by the algorithm. Under-sampling over-represented classes is good because training is faster :) but you are dropping data which is bad, especially if your model is already in an overfitting regime (significant gap between train and test scores). Over-sampling is in generally mathematically equivalent to re-weighting but slower because of duplicated operations. – ogrisel Feb 23 '13 at 14:42
9

@agentscully Have you read the following paper,

[SMOTE] (https://www.jair.org/media/953/live-953-2037-jair.pdf). I have found the same very informative. Here is the link to the Repo. Depending on how you go about balancing your target classes, either you can use

  • 'auto': (is deprecated in the newer version 0.17) or 'balanced' or specify the class ratio yourself {0: 0.1, 1: 0.9}.
  • 'balanced': This mode adjusts the weights inversely proportional to class frequencies n_samples / (n_classes * np.bincount(y)

Let me know, if more insight is needed.

Pramit
  • 1,373
  • 1
  • 18
  • 27