I have a huge data frame that doesn't fit into memory. Thus I access it in Python via dask
(distributed).
I want to train a Word2Vec/Doc2Vec model with the package gensim
based on the entries of one column in the data frame, that's why I built an iterator like in this question.
Now, gensim
trains using multiple cores whose number I need to specify, and similarly dask
allows me to use multiple cores, too. So far I gave all available cores to dask
and the same number of cores to gensim
. My reasoning would be that fetching data and training on the data are exclusive tasks that cannot be done at the same time, so gensim
and dask
shouldn't fight over the cores.
Indeed, there are no error messages, but still, training seems to be quite slow, and I suspect there's a better way to distribute the work. Does anyone have experience in this matter?