4

When I study Python SKlearn, the first example that I come across is Generalized Linear Models.

Code of its very first example:

from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit([[0, 0], [1, 1], [2,2]], [0, 1,2])
reg.fit
reg.coef_
array([ 0.5,  0.5])

Here I assume [[0, 0], [1, 1], [2,2]] represents a data.frame containing x1 = c(0,1,2) and x2 = c(0,1,2) and y = c(0,1,2) as well.

Immediately, I begin to think that array([ 0.5, 0.5]) are the coeffs for x1 and x2.

But, are there standard errors for those estimates? What about t tests p values, R2 and other figures?

Then I try to do the same thing in R.

X = data.frame(x1 = c(0,1,2),x2 = c(0,1,2),y = c(0,1,2))
lm(data=X, y~x1+x2)
Call:
lm(formula = y ~ x1 + x2, data = X)

#Coefficients:
#(Intercept)           x1           x2  
#  1.282e-16    1.000e+00           NA  

Obviously x1 and x2 are completely linearly dependent so the OLS will fail. Why the SKlearn still works and gives this results? Am I getting sklearn in a wrong way? Thanks.

John
  • 1,779
  • 3
  • 25
  • 53
  • 3
    perhaps http://stats.stackexchange.com/questions/116825/different-output-for-r-lm-and-python-statsmodel-ols-for-linear-regression – hrbrmstr Oct 11 '16 at 01:27

1 Answers1

8

Both solutions are correct (assuming that NA behaves like a zero). Which solution is favored depends on the numerical solver used by the OLS estimator.

sklearn.linear_model.LinearRegression is based on scipy.linalg.lstsq which in turn calls the LAPACK gelsd routine which is described here:

http://www.netlib.org/lapack/lug/node27.html

In particular it says that when the problem is rank deficient it seeks the minimum norm least squares solution.

If you want to favor the other solution, you can use a coordinate descent solver with a tiny bit of L1 penalty as implemented in th Lasso class:

>>> from sklearn.linear_model import Lasso
>>> reg = Lasso(alpha=1e-8)
>>> reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])

Lasso(alpha=1e-08, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)
>>> reg.coef_
array([  9.99999985e-01,   3.97204719e-17])
ogrisel
  • 39,309
  • 12
  • 116
  • 125