2

I am experimenting with different batch sizes (N) and sequence lengths (L) for LSTM. Clearly, number of computations required for N=10, L=100 and N=100, L=10 should be the same if there is no parallelization. However, I observe that the larger batch size leads to 2+ times faster computation on CPU.

Also, I see no difference in the speed of hidden size (of LSTM) being 10 and hidden size being 100. There is difference (~2.5 times faster) between 100 and 1000, however.

These observations lead me to believe that TensorFlow also benefits from some parallelization in CPU. Is this true?

The code is as following:

import tensorflow as tf
import numpy as np
import time

from tensorflow.contrib.rnn import BasicLSTMCell

N = 1
L = 500
d = 10
k = 1
V = 1000

inputs = tf.placeholder('int64', [N, L])
emb_mat = tf.get_variable('emb_mat', shape=[V, d])
x = tf.nn.embedding_lookup(emb_mat, inputs)
lstm_cell = BasicLSTMCell(d)
a, (c, h) = tf.nn.dynamic_rnn(lstm_cell, x, dtype='float')
init_op = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init_op)

    inputs_val = np.zeros([N, L], dtype='int64')
    t0 = time.time()
    for _ in range(k):
        out = sess.run(h, feed_dict={inputs: inputs_val})
    t1 = time.time()
    print(t1 - t0)
Minjoon Seo
  • 526
  • 7
  • 10
  • I think this happens because when you work with a CPU (sequentially) the most significant operation is starting and re-running the operation. Hence, the more batches you have the longer it takes. But I may be wrong... – Alon Alexander May 05 '17 at 18:30
  • TensorFlow can be compiled with SIMD options (SSE/AVX) which allow the CPU to perform a single operation on multiple data points simultaneously. – BHawk May 05 '17 at 18:36
  • In my experience it appears that TF parallelizes on CPU as well. At least when I run it on Linux and look at `top` in the terminal (or run it on Windows and look at the task manager), I am using all the cores available to me for training. – Engineero May 05 '17 at 19:36

0 Answers0