2

I want to train word2vec model on a very big corpus such that embedded words cannot be allocated to RAM.

I know there are existing solutions for the algorithm parallelization, for example Spark implementation, but I would like to use tensorflow library.

Is it possible?

Denis Kuzin
  • 863
  • 2
  • 10
  • 18

2 Answers2

2

Yes, it is possible in tensorflow out-of-the-box. The trick is to use variable partitioning, e.g. tf.fixed_size_partitioner, and parameter server replication via tf.train.replica_device_setter to split the variable across several machines. Here's how it looks like in code:

with tf.device(tf.train.replica_device_setter(ps_tasks=3)):
  embedding = tf.get_variable("embedding", [1000000000, 20],
                              partitioner=tf.fixed_size_partitioner(3))

The best part is that these changes are very local and for the rest of the training code it doesn't make any difference. In runtime, however, these is a big difference, namely embedding will be chunked into 3 shards, each pinned to a different ps task, which you can run on a separate machine. See also this relevant question.

Maxim
  • 52,561
  • 27
  • 155
  • 209
1

The authors of word2vec implemented the algorithm using an asynchronous SGD called: HogWild!. So you might want to look for tensor flow implementation of this algorithm.

In HogWild!, each thread takes a sample at a time and performs an update to the weights without any synchronization with other threads. These updates from different threads can potentially overwrite each other, leading to data race conditions. But Hogwild! authors show that it works well for very sparse data sets, where many samples are actually near independent since they write to mostly different indices of the model.

Vijay Mariappan
  • 16,921
  • 3
  • 40
  • 59