I'm using the scikit-learn
LinearRegression
class to fit some data. I have a combination of numeric, boolean, and nominal data, the latter of which is split into one boolean feature per class. When I try to fit 4474
samples or less, everything seems fine (haven't evaluated the exact fit yet on withheld data). But when I try to fit on 4475
or more samples, the coefficients explode.
Here's the coefficients with 4474 samples (I've had to alter the feature names, so I apologize if that makes it difficult to understand):
-53.3027 A=0
-50.6795 A=1
-42.1567 A=2
-49.4219 A=3
-66.0913 A=4
-52.0004 A=5
-43.0018 A=6
356.6542 A=7
-0.2452 B
27.1991 C
6.4098 D=0
-10.8283 D=1
4.4185 D=2
-5.4939 E=0
5.4939 E=1
7.5636 F=0
23.2613 F=1
15.6801 F=2
16.6490 F=3
20.1203 F=4
15.6462 F=5
-98.9207 F=6
74.4071 [intercept]
And here are the coefficients with 4475 samples:
-8851548433742.3105 A=0
-8851548433739.5312 A=1
-8851548433731.1660 A=2
-8851548433738.4355 A=3
-8851548433755.1465 A=4
-8851548433740.6699 A=5
-8851548433731.6973 A=6
-8851548433330.8164 A=7
-0.2412 B
27.2095 C
7046334744114.7773 D=0
7046334744097.5303 D=1
7046334744112.7656 D=2
5440635352.3035 E=0
5440635363.2956 E=1
-796471905928.9204 F=0
-796471905913.2181 F=1
-796471905920.8073 F=2
-796471905919.8351 F=3
-796471905916.3661 F=4
-796471905920.8374 F=5
-796471906035.3826 F=6
2596244960233.4243 [intercept]
Interestingly it seems to learn what the nominal classes are, since it gives roughly similar values to all the other potential classes for a given higher-level feature (e.g., all the A=*
are roughly the same). So the mutual exclusivity of these features seems to be part of the problem.
There's nothing special about the 4475th sample, it's actually identical to the 4474th sample. I've tried skipping that sample, and get the same effect. Basically, I can't really scale my data up to 5k or more (and I have 100k samples, so I really do need to scale it up further).
I've even done some filtering (i.e., removing samples with missing data instead of using a default value), and it has the same effect (in which case I see the explosion between the 4342nd and 4343rd sample after filtering, which are the 4474th and 4475th sample before filtering). So it's definitely in part due to some underlying quirk in the data, but that doesn't mean this is an intentional effect, it clearly can't be.
In case you're wondering, the coefficients for the 4474 samples above kind of make sense for the dataset I'm using.