3

I have a substantially large dataset which includes more than 100 coefficients and thousands of entries. Therefore, I would like to use the Lasso approach for model training.

I am currently looking into the sci-kit documentation for:

Although the implementation seems straight forward, I was unable to find an input argument which allows restricting the maximum number of non-zero coefficients, e.g. to 10.

To be more clear, in the MatLab implementation of Lasso, the parameter 'DFMax' allows for the above.

Is there such an option in any Python implementation?

frek13
  • 162
  • 3
  • 8
  • 1
    +1 because [statsmodels](http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.OLS.fit_regularized.html#statsmodels-regression-linear-model-ols-fit-regularized) also doesn't appear to have a DFMax parameter. – gerowam Jan 27 '17 at 19:25
  • Hmm. Just a theory-remark: Hard-constraints on the number of non-zero coeffs transforms this easy problem (which is in complexity class P) into a hard one (NP-hard) which is in general infeasible to solve. Not sure how matlab is handling this (not much possible besides branch and bound). I would not be surprised to see a huge performance drop then. You can easily define this problem as mixed-integer programming problem in cvxpy for example. – sascha Jan 27 '17 at 22:14

2 Answers2

0

Restricting directly the number of nonzero coefficients is an NP-hard problem, and this is one of the beauty of LASSO which asymptotically solves this NP-hard problem.

I don't know the implement of DFMax in Matlab, but my suggestion is do the following:

  1. Use LassoCV to find the best alpha value.
  2. If the number of nonzero coefficients is smaller than your limit, take this alpha value.
  3. If the number of nonzero coefficients is larger than your limit, use Lasso and a list of increasing alphas with your LassoCV's alpha as the minimum value, and stop when the number of nonzero coefficients is equal or below your threshold.
DiveIntoML
  • 2,347
  • 2
  • 20
  • 36
0

I don't think the accepted answer is best. Here is an example of finding a certain number of Lasso coefficients.

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from scipy.optimize import differential_evolution

X, y = make_classification(n_samples=2000, n_features=50, n_informative=10, random_state=10)
logit = LogisticRegression(penalty='l1', C=1.0)

target = 10

def func(C):
    logit = LogisticRegression(penalty='l1', C=C[0], solver='liblinear')
    logit.fit(X, y)
    n_nonzero = np.sum(logit.coef_ != 0)
    return (target-n_nonzero)**2

differential_evolution(func, bounds=[(0, 2)], tol=0.1, maxiter=20)
     fun: 0.0
 message: 'Optimization terminated successfully.'
    nfev: 212
     nit: 13
 success: True
       x: array([0.03048243])
logit = LogisticRegression(penalty='l1', C=0.03048243, solver='liblinear')
logit.fit(X, y)
np.sum(logit.coef_ != 0)

We have found the optimal regularization parameter in order to have exactly 10 non-zero coefficients.

Jonathan
  • 1,287
  • 14
  • 17