Transforming mLSTM - Run it on multiple GPUs

Question

I'm running an mLSTM (multiplicative LSTM) transform (based on mLSTM by OpenAi (just the transform, it is already trained) but it takes a really long time to transform more than ~100,000 docs.

I want it to run on multiple GPUs. I saw some examples but I have no idea how to implement it on this mLSTM transform code.

The specific part that I want to run on multiple GPUs is:

        def transform(xs):
            tstart = time.time()
            xs = [preprocess(x) for x in xs]
            lens = np.asarray([len(x) for x in xs])
            sorted_idxs = np.argsort(lens)
            unsort_idxs = np.argsort(sorted_idxs)
            sorted_xs = [xs[i] for i in sorted_idxs]
            maxlen = np.max(lens)
            offset = 0
            n = len(xs)
            smb = np.zeros((2, n, hps.nhidden), dtype=np.float32)
            for step in range(0, ceil_round_step(maxlen, nsteps), nsteps):
                start = step
                end = step+nsteps
                xsubseq = [x[start:end] for x in sorted_xs]
                ndone = sum([x == b'' for x in xsubseq])
                offset += ndone
                xsubseq = xsubseq[ndone:]
                sorted_xs = sorted_xs[ndone:]
                nsubseq = len(xsubseq)
                xmb, mmb = batch_pad(xsubseq, nsubseq, nsteps)
                for batch in range(0, nsubseq, nbatch):
                    start = batch
                    end = batch+nbatch
                    batch_smb = seq_rep(
                        xmb[start:end], mmb[start:end],
                        smb[:, offset+start:offset+end, :])
                    smb[:, offset+start:offset+end, :] = batch_smb
            features = smb[0, unsort_idxs, :]
            print('%0.3f seconds to transform %d examples' %
                  (time.time() - tstart, n))
            return features

This is just a snippet of the full code (I don't think it's OK to copy the entire code here).

Maxim · Answer 1 · 2017-12-07T09:08:18.247

The part you're referring to is not the place that splits the computation across GPUs, it only transforms the data (on a CPU!) and runs the session.

The correct place is one that defines the computational graph, e.g. mlstm method. There are many ways to split graph, e.g. place LSTM cells on different GPUs, so that the input sequence can be processed in parallel:

def mlstm(inputs, c, h, M, ndim, scope='lstm', wn=False):
  [...]
  for idx, x in enumerate(inputs):
    with tf.device('/gpu:' + str(i % GPU_COUNT)):
      m = tf.matmul(x, wmx) * tf.matmul(h, wmh)
      z = tf.matmul(x, wx) + tf.matmul(m, wh) + b
      [...]

By the way, there is a useful config option in tensorflow log_device_placement that helps to see the execution details in the output. Here's an example:

import tensorflow as tf

# Creates a graph.
with tf.device('/gpu:0'):
 a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], name='a')
 b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], name='b')
 c = tf.add(a, b)

# Creates a session with log_device_placement set to True.
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
  # Prints the following:
  # Device mapping:
  # /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: <GPU name>, pci bus id: 0000:01:00.0, compute capability: 6.1
  # Add: (Add): /job:localhost/replica:0/task:0/device:GPU:0
  # b: (Const): /job:localhost/replica:0/task:0/device:GPU:0
  # a: (Const): /job:localhost/replica:0/task:0/device:GPU:0
  print(sess.run(c))

This way it will just duplicate the model, no? I want part of it to be trained on one GPU and the other part on the other GPU (in general, there may be more than two GPUs). — Lior Magen, Dec 07 '17 at 07:12
The graph doesn't change with introduction of `tf.device` placement. Also note `GPU_COUNT` in the snippet above. — Maxim, Dec 07 '17 at 07:14
OK got it. Just to understand your solution better, that way the model will be run in parrallel? part of the data will be transformed on one GPU and the other on the second GPU? Because I know that for multi processing, for example, you need to first split your data and send each part of it to a different CPU. Adding this line will do all that? Or is there something I'm missing here? — Lior Magen, Dec 07 '17 at 07:45
In general, there is [data parallelism and model parallelism](https://www.hackingnote.com/en/data-science-in-practice/large-scale/). You describe sounds like data parallelism, my solution is about model parallelism - a single instance of the model is split across multiple nodes allowing for larger models, ones which may not necessarily fit in the memory of a single node, to be trained. — Maxim, Dec 07 '17 at 09:10

Transforming mLSTM - Run it on multiple GPUs

1 Answers1