Yes, it is possible in tensorflow out-of-the-box. The trick is to use variable partitioning, e.g. tf.fixed_size_partitioner
, and parameter server replication via tf.train.replica_device_setter
to split the variable across several machines. Here's how it looks like in code:
with tf.device(tf.train.replica_device_setter(ps_tasks=3)):
embedding = tf.get_variable("embedding", [1000000000, 20],
partitioner=tf.fixed_size_partitioner(3))
The best part is that these changes are very local and for the rest of the training code it doesn't make any difference. In runtime, however, these is a big difference, namely embedding
will be chunked into 3 shards, each pinned to a different ps
task, which you can run on a separate machine. See also this relevant question.