I need to train a LASSO model using sklearn
. I am given a pair of specifically designed training and validation datasets.
The goal is to let the algorithm autogenerate a sequence of alpha
s (the L1 penalty strength), and for each alpha
, fit a model with the training data, and then evaluate the model on the validation data. Finally, select the model that performs the best on the validation data.
How to achieve the above in the most efficient way?
I attempted sklearn.linear_model.LassoCV()
by binding the training and validation data, and enforced it to do like a 1-fold CV by supplying iterator to argument cv
, but the fit()
method will eventually use the optimized alpha
and the entire merged data to produce the final model. I of course can take the optimized alpha
and call sklearn.linear_model.Lasso()
again, but this seems too troublesome:
import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.datasets import make_regression
X, y = make_regression(noise = 4, random_state = 0)
Nrow, Ncol = len(X), len(X[0])
Ntrain = int(np.round(Nrow * 0.7))
Nvalid = Nrow - Ntrain
trainInd = np.asarray([i for i in range(Ntrain)])
validInd = np.asarray([i for i in range(Ntrain, Nrow)])
trainValidInd = [(trainInd, validInd)]
cvIter = iter(trainValidInd)
reg = LassoCV(cv = cvIter, verbose = True).fit(X, y)
'''
But .fit() will use the optimized alpha and the entire merged data to
train the model.
'''
I also attempted sklearn.linear_model.lasso_path()
, but how to apply it to a new dataset (the validation set) and make predictions? It also doesn't return the intercept term. How can I find it?
Thanks!
Came up with a "smart" workaround:
sampleW = np.asarray([1.0 for i in range(Ntrain)] + \
[1e-200 for i in range(Nvalid)])
reg = LassoCV(cv = cvIter, verbose = True).fit(X, y, sampleW)
By lowering the weight on the portion of validation data to almost 0, validation data is effectively excluded from training. Tests have proven its correctness, but it looks ridiculous. It shouldn't be this hard to achieve what I need.