0

I ran lassoCV from scikit-learn using columns in a dask dataframe (loaded from parquet if it makes a difference) as the data in the regression. I did not call compute() beforehand, so the data were really dask dataframe as far as I know. The function appears to have worked, exactly as it does when I pass a pandas dataframe to it.

My question is if we can really pass dask dataframes directly to lassoCV? I ask because I could not find any documentation of that. So I am not sure if the output of the function is correct or if this is just a bug.

EDIT: Below is the relevant part of the code. The full thing happens in two functions with many side calculations. I am only showing the steps involving df0 (the dask-dataframe given to the main function) until the output.

df0 = dd.read_parquet(file)

if df0['date'].compute().is_monotonic:
    ...

alldates = (df0[df0['date'] >= start]['date'].unique()).compute()

...

in_sample = df0[df0['date'] < date]


run_lasso(df=in_sample, ...)

def run_lasso(df,...):
    model = linear_model.LassoCV(cv=5,
                                 max_iter=1000000,
                                 precompute=True,
                                 n_jobs=1)
    df = df.dropna(subset=[y] + xUniverse)
    scaler = StandardScaler()
    X_lasso = scaler.fit_transform(df[xUniverse])
    y_lasso = df[y]

    model.fit(X_lasso, y_lasso)
    signif_coefs = list(df[xUniverse].columns[model.coef_ != 0])

    return signif_coefs
Thiago
  • 55
  • 6
  • 1
    no you can't. scikit-learn assumes the data is in memory and doesn't work out of the box with dask.dataframe. but that is the purpose of [dask-ML](https://ml.dask.org/cross_validation.html). You'll need to re-implement and relax some of your criteria or be prepared for longer runtimes, but concessions are always required when working with larger-than-memory data. – Michael Delgado Aug 06 '22 at 21:12
  • Do you have any idea about how the result that I got was calculated, then? Given that I actually passed a dask dataframe as the argument for lassoCV? To be clear, the output was exactly what I would expect if the estimation was successful. – Thiago Aug 07 '22 at 01:42
  • 2
    It would help to see the actual code you ran, there might have a been a compute call somewhere that turned dask df to pandas df. – SultanOrazbayev Aug 07 '22 at 01:48
  • if it did work, I think it's pretty likely something just caused the dataframe to be computed. so if you're wondering whether the results are correct, I think it's likely that they are. but if your definition of "can" is whether you can use scikit-learn on actually larger-than-memory data while not computing it, I think the answer is no. – Michael Delgado Aug 07 '22 at 02:11
  • 1
    @SultanOrazbayev, I also thought that this could be happening. Maybe I am blind, but I just cant see where. I updated the answer with some of the code now. – Thiago Aug 07 '22 at 12:55

0 Answers0