Constraint the sum of coefficients with scikit learn linear model

Question

I am doing a LassoCV with 1000 coefs. Statsmodels did not seem to able to handle this many coefs. So I am using scikit learn. Statsmodel allowed for .fit_constrained("coef1 + coef2...=1"). This constrained the sum of the coefs to = 1. I need to do this in Scikit. I am also keeping the intercept at zero.

from sklearn.linear_model import LassoCV

LassoCVmodel = LassoCV(fit_intercept=False)
LassoCVmodel.fit(x,y)

Any help would be appreciated.

That's not supported. You either need to write something from scratch, modify sklearn or use some other lib. — sascha, Jun 27 '17 at 21:53
Could I use sklearn.model_selection.GridSearchCV and fit_params? — TChi, Jun 27 '17 at 22:10
If this task is important to you, you can easily formulate this problem within cvxpy and solving should be fast enough (although you did not specify how many samples you got). — sascha, Jun 27 '17 at 22:40
I am using 1000 right hand variables on one left hand variable. There are 2300 samples. — TChi, Jun 28 '17 at 21:44

sascha · Accepted Answer · 2017-06-28T02:53:21.420

As mentioned in the comments: the docs and the sources do not indicate that this is supported within sklearn!

I just tried the alternative of using off shelf convex-optimization solvers. It's just a simple prototype-like approach and it might not be a good fit for your (incompletely defined) task (sample-size?).

Some comments:

implementation/model-formulation is easy
the problem is harder to solve than i thought
- solver ECOS having general trouble
- solver SCS reaches good accuracy (worse compared to sklearn)
- but: tuning iterations to improve accuracy breaks the solver
  - problem will be infeasible for SCS!
- SCS + bigM-based formulation (constraint is posted as penalization-term within objective) looks usable; but might need tuning
- only open-source solvers were tested and commercial ones might be much better

Further things to try:

Tackling huge problems (where performance gets more important compared to robustness and accuracy), a (Accelerated) Projected Stochastic Gradient approach looks promising

Code

""" data """
from time import perf_counter as pc
import numpy as np
from sklearn import datasets
diabetes = datasets.load_diabetes()
A = diabetes.data
y = diabetes.target
alpha=0.1

print('Problem-size: ', A.shape)

def obj(x):  # following sklearn's definition from user-guide!
    return (1. / (2*A.shape[0])) * np.square(np.linalg.norm(A.dot(x) - y, 2)) + alpha * np.linalg.norm(x, 1)


""" sklearn """
print('\nsklearn classic l1')
from sklearn import linear_model
clf = linear_model.Lasso(alpha=alpha, fit_intercept=False)
t0 = pc()
clf.fit(A, y)
print('used (secs): ', pc() - t0)
print(obj(clf.coef_))
print('sum x: ', np.sum(clf.coef_))

""" cvxpy """
print('\ncvxpy + scs classic l1')
from cvxpy import *
x = Variable(A.shape[1])
objective = Minimize((1. / (2*A.shape[0])) * sum_squares(A*x - y) + alpha * norm(x, 1))
problem = Problem(objective, [])
t0 = pc()
problem.solve(solver=SCS, use_indirect=False, max_iters=10000, verbose=False)
print('used (secs): ', pc() - t0)
print(obj(x.value.flat))
print('sum x: ', np.sum(x.value.flat))

""" cvxpy -> sum x == 1 """
print('\ncvxpy + scs sum == 1 / 1st approach')
objective = Minimize((1. / (2*A.shape[0])) * sum_squares(A*x - y))
constraints = [sum(x) == 1]
problem = Problem(objective, constraints)
t0 = pc()
problem.solve(solver=SCS, use_indirect=False, max_iters=10000, verbose=False)
print('used (secs): ', pc() - t0)
print(obj(x.value.flat))
print('sum x: ', np.sum(x.value.flat))

""" cvxpy approach 2 -> sum x == 1 """
print('\ncvxpy + scs sum == 1 / 2nd approach')
M = 1e6
objective = Minimize((1. / (2*A.shape[0])) * sum_squares(A*x - y) + M*(sum(x) - 1))
constraints = [sum(x) == 1]
problem = Problem(objective, constraints)
t0 = pc()
problem.solve(solver=SCS, use_indirect=False, max_iters=10000, verbose=False)
print('used (secs): ', pc() - t0)
print(obj(x.value.flat))
print('sum x: ', np.sum(x.value.flat))

Output

Problem-size:  (442, 10)

sklearn classic l1
used (secs):  0.001451024380348898
13201.3508496
sum x:  891.78869298

cvxpy + scs classic l1
used (secs):  0.011165673357417458
13203.6549995
sum x:  872.520510561

cvxpy + scs sum == 1 / 1st approach
used (secs):  0.15350853891775978
13400.1272148
sum x:  -8.43795102327

cvxpy + scs sum == 1 / 2nd approach
used (secs):  0.012579569383536493
13397.2932976
sum x:  1.01207061047

Edit

Just for fun i implemented a slow non-optimized prototype solver using the approach of accelerated projected gradient (remarks in code!).

This one should scale much better for huge problems (as it's a first-order method), despite slow behaviour here (because not optimized). There should be a lot of potential!

Warning: might be seen as advanced numerical-optimization to some people :-)

Edit 2: I forgot to add the nonnegative-constraint on the projection (sum(x) == 1 makes not much sense if x can be nonnegative!). This makes the solving much harder (numerical-trouble) and it's obvious, that one of those fast special-purpose projections should be used (and i'm too lazy right now; i think n*log n algs are available). Again: this APG-solver is a prototype not ready for real tasks.

Code

""" accelerated pg  -> sum x == 1 """
def solve_pg(A, b, momentum=0.9, maxiter=1000):
    """ remarks:
            algorithm: accelerated projected gradient
            projection: proj on probability-simplex
                -> naive and slow using cvxpy + ecos
            line-search: armijo-rule along projection-arc (Bertsekas book)
                -> suffers from slow projection
            stopping-criterion: naive
            gradient-calculation: precomputes AtA
                -> not needed and not recommended for huge sparse data!
    """

    M, N = A.shape
    x = np.zeros(N)

    AtA = A.T.dot(A)
    Atb = A.T.dot(b)

    stop_count = 0

    # projection helper
    x_ = Variable(N)
    v_ = Parameter(N)
    objective_ =  Minimize(0.5 * square(norm(x_ - v_, 2)))
    constraints_ = [sum(x_) == 1]
    problem_ = Problem(objective_, constraints_)

    def gradient(x):
        return AtA.dot(x) - Atb

    def obj(x):
        return 0.5 * np.linalg.norm(A.dot(x) - b)**2

    it = 0
    while True:
        grad = gradient(x)

        # line search
        alpha = 1
        beta = 0.5
        sigma=1e-2
        old_obj = obj(x)
        while True:
            new_x = x - alpha * grad
            new_obj = obj(new_x)
            if old_obj - new_obj >= sigma * grad.dot(x - new_x):
                break
            else:
                alpha *= beta

        x_old = x[:]
        x = x - alpha*grad

        # projection
        v_.value = x
        problem_.solve()
        x = np.array(x_.value.flat)

        y = x + momentum * (x - x_old)

        if np.abs(old_obj - obj(x)) < 1e-2:
            stop_count += 1
        else:
            stop_count = 0

        if stop_count == 3:
            print('early-stopping @ it: ', it)
            return x

        it += 1

        if it == maxiter:
            return x


print('\n acc pg')
t0 = pc()
x = solve_pg(A, y)
print('used (secs): ', pc() - t0)
print(obj(x))
print('sum x: ', np.sum(x))

Output

acc pg
early-stopping @ it:  367
used (secs):  0.7714511330487027
13396.8642379
sum x:  1.00000000002

Wow, this is super helpful. Thank you for spending the time. — TChi, Jun 28 '17 at 21:44

score 0 · Answer 2 · answered Aug 20 '19 at 23:51

I am surprised nobody has stated this before in the comments, but I think there is a conceptual misunderstanding in your question statement.

Let us start with the definition of the Lasso Estimator, for example as given in Statistical Learning with Sparsity The Lasso and Generalizations by Hastie, Tibshirani and Wainwright:

Given a collection of N predictor-response pairs {(xi,yi)}, the lasso finds the fit coefficients (β0,βi) to the least-square optimization problem with the additional constraint that the L1-norm of the vector of coefficients βi is less than or equal to t.

Where the L1-norm of the coefficient vector is the sum of the magnitudes of all coefficients. In the case where your coefficients are all positive, this is precisely tackling your question.

Now, what is the relationship between this t and the alpha parameter used in scikit-learn? Well, it turns out that by Lagrangian duality, there is a one-to-one correspondence between every value of t and a value for alpha.

This means that when you use LassoCV, since you are using a range of values for alpha, you are using by definition a range of allowable values for the sum of all your coefficients!

To sum up, the condition of the sum of all your coefficients being equal to one is equivalent to using Lasso for a particular value of alpha.

Constraint the sum of coefficients with scikit learn linear model

2 Answers2

Code

Output

Code

Output