1

I need to train a LASSO model using sklearn. I am given a pair of specifically designed training and validation datasets.

The goal is to let the algorithm autogenerate a sequence of alphas (the L1 penalty strength), and for each alpha, fit a model with the training data, and then evaluate the model on the validation data. Finally, select the model that performs the best on the validation data.

How to achieve the above in the most efficient way?

I attempted sklearn.linear_model.LassoCV() by binding the training and validation data, and enforced it to do like a 1-fold CV by supplying iterator to argument cv, but the fit() method will eventually use the optimized alpha and the entire merged data to produce the final model. I of course can take the optimized alpha and call sklearn.linear_model.Lasso() again, but this seems too troublesome:

import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.datasets import make_regression
X, y = make_regression(noise = 4, random_state = 0)
Nrow, Ncol = len(X), len(X[0])
Ntrain = int(np.round(Nrow * 0.7))
Nvalid = Nrow - Ntrain
trainInd = np.asarray([i for i in range(Ntrain)])
validInd = np.asarray([i for i in range(Ntrain, Nrow)])
trainValidInd = [(trainInd, validInd)]
cvIter = iter(trainValidInd)


reg = LassoCV(cv = cvIter, verbose = True).fit(X, y) 
'''
But .fit() will use the optimized alpha and the entire merged data to
train the model.
'''

I also attempted sklearn.linear_model.lasso_path(), but how to apply it to a new dataset (the validation set) and make predictions? It also doesn't return the intercept term. How can I find it?

Thanks!

Came up with a "smart" workaround:

sampleW = np.asarray([1.0 for i in range(Ntrain)] + \
    [1e-200 for i in range(Nvalid)]) 
reg = LassoCV(cv = cvIter, verbose = True).fit(X, y, sampleW)

By lowering the weight on the portion of validation data to almost 0, validation data is effectively excluded from training. Tests have proven its correctness, but it looks ridiculous. It shouldn't be this hard to achieve what I need.

user2961927
  • 1,290
  • 1
  • 14
  • 22

1 Answers1

0

This may be too basic for what you're looking for, but I would focus on the problem that you've already identified: finding the optimal alpha value. The first thing that comes to mind is to use a scipy optimizer, something like this:

import numpy as np
from scipy.optimize import minimize_scalar
from sklearn.linear_model import Lasso

def cost(alpha):
  model = Lasso(alpha=alpha)
  model.fit(training_X)
  return np.linalg.norm(
    model.predict(validation_X) - validation_y)

res = minimize_scalar(cost)
print('Optimal alpha', res.x, 'yields error', res.fun)

Since you're trying to find the best lasso as a function of only the alpha value, you only need to minimize the scalar-input, scalar-output cost function. (docs)

lmjohns3
  • 7,422
  • 5
  • 36
  • 56