5

I am using scikit-learn's LogisticRegression object for regularized binary classification. I've read the documentation on intercept_scaling but I don't understand how to choose this value intelligently.

The datasets look like this:

  • 10-20 features, 300-500 replicates
  • Highly non-Gaussian, in fact most observations are zeros
  • The output classes are not necessarily equally likely. In some cases they are almost 50/50, in other cases they are more like 90/10.
  • Typically C=0.001 gives good cross-validated results.

The documentation contains warnings that the intercept itself is subject to regularization, like every other feature, and that intercept_scaling can be used to address this. But how should I choose this value? One simple answer is to explore many possible combinations of C and intercept_scaling and choose the parameters that give the best performance. But this parameter search will take quite a while and I'd like to avoid that if possible.

Ideally, I would like to use the intercept to control the distribution of output predictions. That is, I would like to ensure that the probability that the classifier predicts "class 1" on the training set is equal to the proportion of "class 1" data in the training set. I know that this is the case under certain circumstances, but this is not the case in my data. I don't know if it's due to the regularization or to the non-Gaussian nature of the input data.

Thanks for any suggestions!

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
cxrodgers
  • 4,317
  • 2
  • 23
  • 29

1 Answers1

2

While you tried oversampling the positive class by setting class_weight="auto"? That effectively oversamples the underrepresented classes and undersamples the majority class.

(The current stable docs are a bit confusing since they seem to have been copy-pasted from SVC and not edited for LR; that's just changed in the bleeding edge version.)

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • Thank you for the suggestion. I'm not using 'auto' because I can't figure out what it does (the code involves multiple inheritance and is therefore a bit confusing to me). Instead, I'm just setting the class weights to be equal to the number of replicates from that class, because this seemed to do the right thing. Can you link to the updated documentaiton? I can't find it by googling "sklearn bleeding-edge" or visiting the "bleeding edge" section of the sklearn website. – cxrodgers Jul 18 '13 at 22:01
  • @cxrodgers: it should be at http://scikit-learn.org/dev, but the patch is so far that I'm not sure the website has been rebuilt yet :) – Fred Foo Jul 18 '13 at 22:33
  • 1
    It doesn't look like it's been updated but I'll keep checking. This has been useful so I'll accept this answer soon, assuming no one shows up to explain intercept_scaling to me. I wish that the documentation would state explicitly what loss function is being optimized, which would make `C` and `intercept_scaling` explicit. I assume it's the same as in the linked research paper, but the terminology is sufficiently different that I'm not sure what exactly is happening, especially with `class_weight`. – cxrodgers Jul 22 '13 at 23:23
  • @cxrodgers: http://scikit-learn.org/dev/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression – Fred Foo Jul 23 '13 at 10:27