Recently, I have used tensorflow to develop an NMT system. I tried to train this system on multi-gpus using data-parallelism method to speed up it. I follow the standard data-parallelism way widely used in tensorflow. For example, if we want to run it on a 8-gpus computer. First, we construct a large batch which contains 8 times the size of batch used in a single GPU. Then we split this large batch equally to 8 mini-batch. We separately train them in different gpus. In the end, we collect gradients to update paramters. But I find when I used dynamic_rnn, the average time taken by one iteration in 8 gpus is two times long of that taken by one iteration trained in a single gpu. I make sure the batch size for each gpu is the same. Who has a better way to speed up the training of RNN in tensorflow?
Asked
Active
Viewed 550 times
0
-
The average iteration time is inherently longer, you have to send the data to all the GPUs, wait for all of them to finish, compute the update and update the parameters. The difference is, *in theory*, you'll need less training steps because you're training on bigger batches, so overall the training time should be reduced. – GPhilo Nov 28 '17 at 11:19
-
A very interesting finding is that the average iteration time would be much smaller if you use tf.static_rnn. In addition, if you do not use RNN and run other neural models using the same data parallelism, you will find that the speeding-up ratio is acceptable. So I think that tensorflow does not well support RNN, especially for tf.dynamic_rnn. – user4193910 Nov 29 '17 at 07:43