How to train doc2vec on AWS cluster using spark

Question

I'm using python Gensim to train doc2vec. Is there any possibility to allow this code to be distributed on AWS (s3). Thank you in advance

score 1 · Answer 1 · answered May 31 '17 at 18:08

1

Gensim's Doc2Vec is not designed to distribute training over multiple-machines. It'd be a significant and complex project to adapt its initial bulk training to do that.

Are you sure your dataset and goals require such distribution? You can get a lot done on a single machine with many cores & 128GB+ RAM.

Note that you can also train a Doc2Vec model on a smaller representative dataset, then use its .infer_vector() method on the frozen model to calculate doc-vectors for any number of additional texts. Those frozen models can be spun up on multiple machines – allowing arbitrarily-distributed calculation of doc-vectors. (That would be far easier than distributing initial training.)

answered May 31 '17 at 18:08

gojomo

52,260
14
86
115

How can it be done? I understand I can use `.infer_vector` and most probably I will need to do it this way since my data is very big and it is impractical to train doc2vec every single time with new data entering the system. The problem is that my data is coming as pyspark.sql.dataframe.DataFrame and to allow inference I need TaggedDocument format to use .infer_vector(). When I use `df.select("text").rdd.flatMap(lambda r: r).collect()` with such a big dataset I'm getting stuck for a long time. How I can do it effectively and how to allow arbitrarily-distributed calculation. – Regina Jun 16 '17 at 02:49
As noted above, gensim's Doc2Vec is not designed for multi-machine or multi-process training – so you may want to drop out of Spark, write all the docs you'll be using for the single-process training to a scratch file, do the training outside Spark, then make the (frozen, read-only) trained model available for later Spark steps. – gojomo Jun 16 '17 at 16:41

How to train doc2vec on AWS cluster using spark

1 Answers1