1

I have a huge dataset as input for a multiple lasso fit. The predictor values have size of 1250 by 1milion and the target value is 1250 by 1250

If I fit a normal regression by sklearn there there is an option to use multiple threads which in this case the whole process runs in a short time with an acceptable result.

sklearn.linear_model.LinearRegression(*, fit_intercept=True, normalize='deprecated', copy_X=True, n_jobs=None, positive=False)

In the upper line if I set n_jobs=-1 it will use all the cores available so that computational cost will drop dramatically.

But, there is no such an option for lasso regression in sklearn:

sklearn.linear_model.Lasso(alpha=1.0, *, fit_intercept=True, normalize='deprecated', precompute=False, copy_X=True, max_iter=1000, tol=0.0001, warm_start=False, positive=False, random_state=None, selection='cyclic')

Obviously, it is really computationally expensive if I run this fitting on a single core. There are options in scikit-learn which one can run cross-validation for lasso one different cpu. But my problem is that I'm not going to do hyper-parameter optimization. The single problem it self is computationally expensive.

Questions:

  1. Is there any way to do a distributed multiple lasso regression?(not for hyper-parameter optimization)
  2. If there isn't any way for parallel lasso regression, what is the root of this limitation? What is the difference between minimization of lost function for regression and lasso regression?
mjoudy
  • 149
  • 1
  • 1
  • 10

1 Answers1

0

As stated in the documentation for n_jobs :

n_jobs int, default=None

The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly n_targets > 1 and secondly X is sparse or if positive is set to True. 

You need to have more than 1 target, your dependent variable needs to have 2 or more columns

The parallelization work by fitting a model on each of the y-variable separately as you can see from the source code :

    if self.positive:
        if y.ndim < 2:
            self.coef_ = optimize.nnls(X, y)[0]
        else:
            # scipy.optimize.nnls cannot handle y with shape (M, K)
            outs = Parallel(n_jobs=n_jobs_)(
                delayed(optimize.nnls)(X, y[:, j]) for j in range(y.shape[1])
            )
            self.coef_ = np.vstack([out[0] for out in outs])

I am not sure if you have more than 1 target variable. If that is indeed the case, you can consider using MultiOutputRegressor

I don't think there's a way to parallelize fitting a lasso or linear model when there's only 1 target variable.

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • My target variable as I said has several columns. You explained about `n_jobs`. But the problem is that there is no argument of `n_jobs` in Lasso function. – mjoudy Mar 02 '23 at 21:36