3

I have a Dask Dataframe with the following content:

    X_trn                                               y_trn
0   java repeat task every random seconds p m alre...   LQ_CLOSE
1   are java optionals immutable p d like to under...   HQ
2   text overlay image with darkened opacity react...   HQ
3   ternary operator in swift is so picky p questi...   HQ
4   hide show fab with scale animation p m using c...   HQ

I am trying to use CountVectorizer from dask.ml's library. When I do pass my X_trn to fit_transform, I get the Value Error "Cannot infer dataframe metadata with a dask.delayed argument'".

vectorizer = CountVectorizer()
countMatrix = vectorizer.fit_transform(training['X_trn'])
mendy
  • 191
  • 1
  • 12

1 Answers1

0

This answer will probably come too late for the original author but may still help others. The answer is actually in the documentation I also overlooked it at first:

The Dask-ML implementation currently requires that raw_documents is a dask.bag.Bag of documents (lists of strings).

This apparently innocense sentence is your problem. You are passing a dask.dataframe and not a dask.bag.Bag of documents

import dask.bag as db
corpus = db.from_sequence(training['X_trn'], npartitions=2)

And then, you can pass it to the vectorizer as you were doing:

 X = vectorizer.fit_transform(corpus)
G. Macia
  • 1,204
  • 3
  • 23
  • 38