lasso regression for larger datasets using dask

Question

I'm attempting to perform a lasso regression for a larger than main memory dataset by using Dask, but there doesn't seem to be a cleanly documented way to do so.

I did previously find a somewhat related question but no actual answer.

Looking into how scikit sets up the Lasso regression, I thought I could set it up the same way. For example, here is one approach I tried

from dask_ml.datasets import make_regression
import dask_glm.families
import dask_glm.regularizers
import dask_glm.algorithms
import pandas as pd

# dask dataframes
X, y = make_regression(n_samples=1000, chunks=100)

# pandas dataframes
df_X = X.compute()
df_y = y.compute()

family = dask_glm.families.Normal()
regularizer = dask_glm.regularizers.ElasticNet(weight=1)
b = dask_glm.algorithms.gradient_descent(X=X, y=y, max_iter=100000, family=family, regularizer=regularizer, alpha=0.01, normalize=False, fit_intercept=False)
print(b)

reg = linear_model.Lasso(alpha=0.01, fit_intercept=False)
reg.fit(df_X, df_y)
print(reg.coef_)

However, the coefficients don't match up at all, and the dask version's coefficients seem more unstable than scikit's.

Here's another approach I tried, this time based on a comment from this GH issue

from dask_ml.datasets import make_regression
from dask_glm.regularizers import L1
from dask_glm.estimators import LinearRegression

X, y = make_regression(n_samples=1000, chunks=100)

lr = LinearRegression(regularizer=L1())
lr.fit(X, y)
print(lr.coef_)

Again, the coefficients seem very unstable.

Ideally there would already be an implementation of Lasso using Dask for this, but I can't seem to find much on the internet except for running LassoCV with dask as the backend to joblib, which is a little different than I want.

lasso regression for larger datasets using dask

0 Answers0