4

it's known that when the number of variables (p) is larger than the number of samples (n) the least square estimator is not defined.

In sklearn I receive this values:

In [30]: lm = LinearRegression().fit(xx,y_train)

In [31]: lm.coef_
Out[31]: 
array([[ 0.20092363, -0.14378298, -0.33504391, ..., -0.40695124,
         0.08619906, -0.08108713]])

In [32]: xx.shape
Out[32]: (1097, 3419)

Call [30] should return an error. How does sklearn work when p>n like in this case?

EDIT: It seems that the matrix is filled with some values

if n > m:
        # need to extend b matrix as it will be filled with
        # a larger solution matrix
        if len(b1.shape) == 2:
            b2 = np.zeros((n, nrhs), dtype=gelss.dtype)
            b2[:m,:] = b1
        else:
            b2 = np.zeros(n, dtype=gelss.dtype)
            b2[:m] = b1
        b1 = b2
Donbeo
  • 17,067
  • 37
  • 114
  • 188

1 Answers1

8

When the linear system is underdetermined, then the sklearn.linear_model.LinearRegression finds the minimum L2 norm solution, i.e.

argmin_w l2_norm(w) subject to Xw = y

This is always well defined and obtainable by applying the pseudoinverse of X to y, i.e.

w = np.linalg.pinv(X).dot(y)

The specific implementation of scipy.linalg.lstsq, which is used by LinearRegression uses get_lapack_funcs(('gelss',), ... which is precisely a solver that finds the minimum norm solution via singular value decomposition (provided by LAPACK).

Check out this example

import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(5, 10)
y = rng.randn(5)

from sklearn.linear_model import LinearRegression
lr = LinearRegression(fit_intercept=False)
coef1 = lr.fit(X, y).coef_
coef2 = np.linalg.pinv(X).dot(y)

print(coef1)
print(coef2)

And you will see that coef1 == coef2. (Note that fit_intercept=False is specified in the constructor of the sklearn estimator, because otherwise it would subtract the mean of each feature before fitting the model, yielding different coefficients)

MiniQuark
  • 46,633
  • 36
  • 147
  • 183
eickenberg
  • 14,152
  • 1
  • 48
  • 52
  • would this be the same solution that is obtained with ridge regression? Is this method implemented only in the new version of sklearn? I am quiete sure that in some version it returns an error – Donbeo May 18 '14 at 21:16
  • 1
    No, ridge regression adds an extra L2 *penalization*, thus changing the objective function and the resulting solution. The minimum L2 norm solution is just a (wise) choice among all the solutions in the affine space of solutions in the situation of underdeterminedness. However, if you take ridge regression solutions as a function of the regularization parameter and let this parameter tend to `0`, you will find exactly this minimum norm solution. – eickenberg May 18 '14 at 21:20
  • ok then the obtained solution should be more on less the same that is obtained as final result of the LARS algorithm path and this make sense because is the standard definition of LSE when p>n. (Is that right?) . Do you know if this was implemented also in the previous version of sklearn-scipy? – Donbeo May 18 '14 at 21:23
  • `LARS-Lasso` will converge to the minimum `L1` norm solution among all solutions when the penalty term goes to `0`. As for *pure* `LARS`, it is possible that you are right, since it does not discard variables, but I cannot give you hard evidence for this. – eickenberg May 18 '14 at 21:27
  • Note that in practice when you are confronted with `n < p` you are way better off using `sklearn.linear_model.Ridge` with a small penalty instead of the minimum norm solution. – eickenberg May 18 '14 at 21:44