2

I'm using the scikit-learn LinearRegression class to fit some data. I have a combination of numeric, boolean, and nominal data, the latter of which is split into one boolean feature per class. When I try to fit 4474 samples or less, everything seems fine (haven't evaluated the exact fit yet on withheld data). But when I try to fit on 4475 or more samples, the coefficients explode.

Here's the coefficients with 4474 samples (I've had to alter the feature names, so I apologize if that makes it difficult to understand):

-53.3027  A=0
-50.6795  A=1
-42.1567  A=2
-49.4219  A=3
-66.0913  A=4
-52.0004  A=5
-43.0018  A=6
356.6542  A=7
 -0.2452  B
 27.1991  C
  6.4098  D=0
-10.8283  D=1
  4.4185  D=2
 -5.4939  E=0
  5.4939  E=1
  7.5636  F=0
 23.2613  F=1
 15.6801  F=2
 16.6490  F=3
 20.1203  F=4
 15.6462  F=5
-98.9207  F=6
 74.4071  [intercept]

And here are the coefficients with 4475 samples:

-8851548433742.3105  A=0
-8851548433739.5312  A=1
-8851548433731.1660  A=2
-8851548433738.4355  A=3
-8851548433755.1465  A=4
-8851548433740.6699  A=5
-8851548433731.6973  A=6
-8851548433330.8164  A=7
            -0.2412  B
            27.2095  C
 7046334744114.7773  D=0
 7046334744097.5303  D=1
 7046334744112.7656  D=2
    5440635352.3035  E=0
    5440635363.2956  E=1
 -796471905928.9204  F=0
 -796471905913.2181  F=1
 -796471905920.8073  F=2
 -796471905919.8351  F=3
 -796471905916.3661  F=4
 -796471905920.8374  F=5
 -796471906035.3826  F=6
 2596244960233.4243  [intercept]

Interestingly it seems to learn what the nominal classes are, since it gives roughly similar values to all the other potential classes for a given higher-level feature (e.g., all the A=* are roughly the same). So the mutual exclusivity of these features seems to be part of the problem.

There's nothing special about the 4475th sample, it's actually identical to the 4474th sample. I've tried skipping that sample, and get the same effect. Basically, I can't really scale my data up to 5k or more (and I have 100k samples, so I really do need to scale it up further).

I've even done some filtering (i.e., removing samples with missing data instead of using a default value), and it has the same effect (in which case I see the explosion between the 4342nd and 4343rd sample after filtering, which are the 4474th and 4475th sample before filtering). So it's definitely in part due to some underlying quirk in the data, but that doesn't mean this is an intentional effect, it clearly can't be.

In case you're wondering, the coefficients for the 4474 samples above kind of make sense for the dataset I'm using.

Kirk
  • 21
  • 2
  • 2
    What is your question? – Bob Dalgleish Jul 14 '17 at 18:37
  • So, basically, is there anything I can do to make this not happen? One thing I did try is log-scaling the target data, which didn't help. – Kirk Jul 14 '17 at 19:54
  • No normalization/standardization (no code shown)? Can help. No regularization (no code shown)? Can help. Not working for ~5k vars does not necessarily mean it won't work for 100k vars i think (of course depending on the algorithm). My best advice (besides the whole normalization, regularization-stuff): use the [SGDRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor) (set the params to be equivalent to your task here; e.g. penalty='none' maybe). – sascha Jul 14 '17 at 20:53
  • @sascha `SGDRegressor` might well have worked, I'll need to investigate further to be sure. I still get crazy large values but when normalized they generally look logical. I didn't do any normalization or regularization before, beyond binarizing the nominals, so the code was basically just `lm = LinearRegression()` and `lm.fit(features, targets)`. I'm guessing that `LinearRegression` fits analytically, while `SGDRegressor` uses an optimization approach (gradient descent). So perhaps that's just the safer bet in general with relatively large datasets? – Kirk Jul 14 '17 at 22:20
  • Well... the original Regressor is based on [this](http://www.netlib.org/lapack/explore-3.1.1-html/dgelsd.f.html) and should be numerically more robust (even if it may fail here). SGDRegressor might fail too under some circumstances. It surely is the suitable approach for huge data, maybe 100k is already to be considered huge (while 5k is not). Normalization and Standardization is always something to think of. – sascha Jul 14 '17 at 22:43

0 Answers0