When I study Python SKlearn, the first example that I come across is Generalized Linear Models.
Code of its very first example:
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit([[0, 0], [1, 1], [2,2]], [0, 1,2])
reg.fit
reg.coef_
array([ 0.5, 0.5])
Here I assume [[0, 0], [1, 1], [2,2]]
represents a data.frame containing x1 = c(0,1,2)
and x2 = c(0,1,2)
and y = c(0,1,2)
as well.
Immediately, I begin to think that array([ 0.5, 0.5])
are the coeffs for x1
and x2
.
But, are there standard errors for those estimates? What about t tests p values, R2 and other figures?
Then I try to do the same thing in R.
X = data.frame(x1 = c(0,1,2),x2 = c(0,1,2),y = c(0,1,2))
lm(data=X, y~x1+x2)
Call:
lm(formula = y ~ x1 + x2, data = X)
#Coefficients:
#(Intercept) x1 x2
# 1.282e-16 1.000e+00 NA
Obviously x1
and x2
are completely linearly dependent so the OLS will fail. Why the SKlearn still works and gives this results? Am I getting sklearn in a wrong way? Thanks.