0

I have a huge data frame that doesn't fit into memory. Thus I access it in Python via dask (distributed).

I want to train a Word2Vec/Doc2Vec model with the package gensim based on the entries of one column in the data frame, that's why I built an iterator like in this question.

Now, gensim trains using multiple cores whose number I need to specify, and similarly dask allows me to use multiple cores, too. So far I gave all available cores to dask and the same number of cores to gensim. My reasoning would be that fetching data and training on the data are exclusive tasks that cannot be done at the same time, so gensim and dask shouldn't fight over the cores.

Indeed, there are no error messages, but still, training seems to be quite slow, and I suspect there's a better way to distribute the work. Does anyone have experience in this matter?

Edgar
  • 412
  • 2
  • 6
  • 15
  • Because there's no facility in gensim for integrating with `dask`, or more generally recombining models that were trained in a separate, distributed fashion, I would try to eliminate both `dask` and dataframes as factors to consider from the `gensim` step. Get just the data you need for the `gensim` step in a streamable, pre-tokenized file (however large), then stream that to Word2Vec/Doc2Vec on a single many-core machine. – gojomo Jan 14 '20 at 21:25

1 Answers1

0

Combining two libraries that both try to use multi-threaded parallelism can be counter-productive. Having too many threads active can cause contention and slow things down. I recommend setting one of the two libraries to use only a single thread.

For Dask, you can do this with the following:

dask.config.set(scheduler="single-threaded")

https://docs.dask.org/en/latest/scheduling.html

MRocklin
  • 55,641
  • 23
  • 163
  • 235