Can you use dask dataframes instead of pandas dataframes to run lassoCV from scikit-learn?

Question

I ran lassoCV from scikit-learn using columns in a dask dataframe (loaded from parquet if it makes a difference) as the data in the regression. I did not call compute() beforehand, so the data were really dask dataframe as far as I know. The function appears to have worked, exactly as it does when I pass a pandas dataframe to it.

My question is if we can really pass dask dataframes directly to lassoCV? I ask because I could not find any documentation of that. So I am not sure if the output of the function is correct or if this is just a bug.

EDIT: Below is the relevant part of the code. The full thing happens in two functions with many side calculations. I am only showing the steps involving df0 (the dask-dataframe given to the main function) until the output.

df0 = dd.read_parquet(file)

if df0['date'].compute().is_monotonic:
    ...

alldates = (df0[df0['date'] >= start]['date'].unique()).compute()

...

in_sample = df0[df0['date'] < date]


run_lasso(df=in_sample, ...)

def run_lasso(df,...):
    model = linear_model.LassoCV(cv=5,
                                 max_iter=1000000,
                                 precompute=True,
                                 n_jobs=1)
    df = df.dropna(subset=[y] + xUniverse)
    scaler = StandardScaler()
    X_lasso = scaler.fit_transform(df[xUniverse])
    y_lasso = df[y]

    model.fit(X_lasso, y_lasso)
    signif_coefs = list(df[xUniverse].columns[model.coef_ != 0])

    return signif_coefs

no you can't. scikit-learn assumes the data is in memory and doesn't work out of the box with dask.dataframe. but that is the purpose of [dask-ML](https://ml.dask.org/cross_validation.html). You'll need to re-implement and relax some of your criteria or be prepared for longer runtimes, but concessions are always required when working with larger-than-memory data. — Michael Delgado, Aug 06 '22 at 21:12
Do you have any idea about how the result that I got was calculated, then? Given that I actually passed a dask dataframe as the argument for lassoCV? To be clear, the output was exactly what I would expect if the estimation was successful. — Thiago, Aug 07 '22 at 01:42
It would help to see the actual code you ran, there might have a been a compute call somewhere that turned dask df to pandas df. — SultanOrazbayev, Aug 07 '22 at 01:48
if it did work, I think it's pretty likely something just caused the dataframe to be computed. so if you're wondering whether the results are correct, I think it's likely that they are. but if your definition of "can" is whether you can use scikit-learn on actually larger-than-memory data while not computing it, I think the answer is no. — Michael Delgado, Aug 07 '22 at 02:11
@SultanOrazbayev, I also thought that this could be happening. Maybe I am blind, but I just cant see where. I updated the answer with some of the code now. — Thiago, Aug 07 '22 at 12:55

Can you use dask dataframes instead of pandas dataframes to run lassoCV from scikit-learn?

0 Answers0

Linked