I ran lassoCV from scikit-learn using columns in a dask dataframe (loaded from parquet if it makes a difference) as the data in the regression. I did not call compute()
beforehand, so the data were really dask dataframe as far as I know. The function appears to have worked, exactly as it does when I pass a pandas dataframe to it.
My question is if we can really pass dask dataframes directly to lassoCV? I ask because I could not find any documentation of that. So I am not sure if the output of the function is correct or if this is just a bug.
EDIT: Below is the relevant part of the code. The full thing happens in two functions with many side calculations. I am only showing the steps involving df0 (the dask-dataframe given to the main function) until the output.
df0 = dd.read_parquet(file)
if df0['date'].compute().is_monotonic:
...
alldates = (df0[df0['date'] >= start]['date'].unique()).compute()
...
in_sample = df0[df0['date'] < date]
run_lasso(df=in_sample, ...)
def run_lasso(df,...):
model = linear_model.LassoCV(cv=5,
max_iter=1000000,
precompute=True,
n_jobs=1)
df = df.dropna(subset=[y] + xUniverse)
scaler = StandardScaler()
X_lasso = scaler.fit_transform(df[xUniverse])
y_lasso = df[y]
model.fit(X_lasso, y_lasso)
signif_coefs = list(df[xUniverse].columns[model.coef_ != 0])
return signif_coefs