0

Hello i am new to dusk Ml, i have been trying to use dask ml to train a logistic regression model to predict tweet sentiment. I have converted a pandas dataframe to a dask dataframe. After that i performed train test split. After that i used hashing vectorizer on X_train and X_test. i executed the line Train_X_vect.compute().shape to check the shape and it returned (180224, 7000) where else y_train.compute().shape returned (180224,) Whenever I try, to fit them in a logistic regression model i get an error saying "cannot add intercept to array with unknown chunk" this is my code:

from dask_ml.feature_extraction.text import HashingVectorizer
from dask_ml.model_selection import train_test_split
from dask_ml.linear_model import LogisticRegression
dask_df = dd.from_pandas(pandas_df,npartitions=4)
X_train, X_test, y_train, y_test = train_test_split(dask_df ["preprocess"], dask_df ["target"],random_state=42)
vectorizer = HashingVectorizer(n_features=7000)
vectorizer.fit(X_train)
Train_X_vect = vectorizer.transform(X_train)
Test_X_vect = vectorizer.transform(X_test)
lr = LogisticRegression()
lr.fit(Train_X_vect,y_train)

I also used "fit_intercept = False" but then i wuld get this error: "IndexError: Index dimension must be <= 2"

Please could you tell me what i am doing wrong, and how should I fix it? Thank you sir

Sabbir Talukdar
  • 115
  • 2
  • 11

1 Answers1

1

Right now LogisticRegression requires that the Dask Array passed to it has known chunk sizes (see Train_X_vect.chunks or .shape). This restriction might be lifted in the future, but in the meantime convert to known chunks with Train_X_vect.compute_chunk_sizes() prior to lr.fit.

TomAugspurger
  • 28,234
  • 8
  • 86
  • 69