I'm new to using Dask but have experienced painfully slow performance when attempting to re-write native sklearn functions in Dask. I've simplified the use-case as much as possible in hope of getting some help.
Using standard sklearn/numpy/pandas etc I have the following:
df = pd.read_csv(location, index_col=False) # A ~75MB CSV
# Build feature list and dependent variables, code irrelevant
from sklearn import linear_model
model = linear_model.Lasso(alpha=0.1, normalize=False, max_iter=100, tol=Tol)
model.fit(features.values, dependent)
print(model.coef_)
print(model.intercept_)
This takes a few seconds to compute. I then have the following in Dask:
# Read in CSV and prepare params like before but using dask arrays/dataframes instead
with joblib.parallel_backend('dask'):
from dask_glm.estimators import LinearRegression
# Coerce data
X = self.features.to_dask_array(lengths=True)
y = self.dependents
# Build regression
lr = LinearRegression(fit_intercept=True, solver='admm', tol=self.tolerance, regularizer='l1', max_iter=100, lamduh=0.1)
lr.fit(X, y)
print(lr.coef_)
print(lr.intercept_)
Which takes ages to compute (about 30 minutes). I only have 1 Dask worker in my development cluster but that has 16GB ram and unbounded CPU.
Has anyone any idea why this is so slow?
Hopefully my code omissions aren't significant!
NB: This is the simplest use-case before people ask why even use Dask - this was used as a proof of concept exercise to check that things would function as expected.