0

A similar question has been asked here how to train multiple neural networks simultaneously, but the answers were specific to Caffe. Here is my specific question:

A friend of mine has designed an RNN for a certain problem using Theano and TensorFlow. It has 14 input nodes, 2 hidden layers with 7 nodes each, and finally an output node. We have around 30,000 such RNNs that need to be trained. I am a software engineer with very little exposure to Machine Learning. What I need to do is to speed up the training process of these RNNs.

Looking at the problem from a CS perspective, I don't think that anything can be done to speed up the training of a single RNN. Running such a small RNN on a GPU makes no sense. Instead, we can achieve speed up by batching the RNNs, say 1000 at a time, and sending them to the GPU. The nature of the problem is SIMD - each RNN is identical, but it has to train on a different data set.

Can someone please explain how this could be done using Theano or TensorFlow?

Here is the code for a single model:

import pandas as pd

df=pd.DataFrame(b,columns=  ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T'])

ds=df.groupby(['A','Q','R']).apply(lambda  h:h.sort('S')).values.tolist()
import math
stationary_id=0
sale_from_previous_day=[]
for i in xrange(0,len(ds)):
    if ds[i][0]!= stationary_id:
        stationary_id=ds[i][0]
        sale_from_previous_day.append(0)
    else:
        if float(ds[i-1][19])==0:
            sale_from_previous_day.append(0)
        else:
            sale_from_previous_day.append(math.log(1+float(ds[i-1][19]))/float(ds[i-1][19]))

import numpy as np
import tensorflow as tf
from tensorflow.python.ops import rnn_cell

# create a placeholder for input layer
input_layer = tf.placeholder(tf.float32, [1, 14])

# no. of neurons & layers
num_hidden = 7
num_layers = 2

# Construct Multilayer RNN
network = rnn_cell.BasicRNNCell(num_hidden)
network1 = rnn_cell.MultiRNNCell([network] * num_layers)

# The hidden state as a Variable initialized to zeroes

state1 = tf.Variable(tf.zeros([1, network1.state_size]))

# Connect the input layer and initial hidden state to the rnn cell
output1, state_output1 = network1(input_layer, state1)

# update the state
update_op1 = state1.assign(state_output1)

#hidden to output weights
output_W1 = tf.Variable(tf.truncated_normal([7, 1]))

#keep an outbias as well

output_b1 = tf.Variable(tf.zeros([1]))

#the outclass linear layer returns predicted output
final_output = tf.matmul(output1, output_W1) + output_b1

#Input for correct output (for training)
correct_output = tf.placeholder(tf.float32, [1, 1])

##Calculate the Sum-of-Squares Error
error = tf.pow(tf.sub(final_output, correct_output), 2)

#Adam's
train_step = tf.train.AdamOptimizer(0.0006).minimize(error)

##session
sess = tf.Session(config=tf.ConfigProto(inter_op_parallelism_threads=1,
               intra_op_parallelism_threads=1))
#Initialize all Variables
sess.run(tf.initialize_all_variables())


for epoch in range(0,7):
   er= 0
   pon= 0
for i in range(len(ds)):
    a,b=np.array([[ds[i][1],ds[i][2],ds[i][3],ds[i][4],ds[i][5],ds[i][6],ds[i][7],ds[i][8],ds[i][9],ds[i][10],ds[i][11],ds[i][12],ds[i][14],sale_from_previous_day[i]]]),np.array([[ds[i][19]]])
    _, _, network_output =     sess.run([update_op1,train_step,final_output],feed_dict = { input_layer:  a,correct_output: b})

    er+= 0.5*((b[0][0]) - (network_output[0][0]))**2
    pon+= 1
    print er/pon

print(int(round(time.time() * 1000))-m1)/1000.0
Community
  • 1
  • 1
user1274878
  • 1,275
  • 4
  • 25
  • 56
  • That example for Caffe just merges the input layer on both nets and renames nodes to prevent name clashes, which is probably trivial to do in Theano and TensorFlow. This saves duplicate copying of identical minibatches to GPU. However, in your case, you have different training sets, so that technique won't help you. – Ken Y-N Jul 14 '16 at 03:20
  • Your problem does seem small (you don't say how big the training sets are), but the DL engines do tend to use every last drop of power the GPUs provide (check `watch nvidia-smi` if on Linux), but you seem to worry that each iteration might only take a couple of milliseconds. Perhaps if you ran multiple threads, while one task was running in the GPU, another could be copying data in, another summing up the outputs, etc? Or just turn off the GPU and run multiple CPU threads? – Ken Y-N Jul 14 '16 at 03:25

1 Answers1

1

I believe what you want to do is make your 1000 separate models look like 1 model for training purposes. So long as the simple models all have the same architecture and differ only by their parameters (which are learned, right? so they're really differing only by the sequences of training examples), it should be possible to define a compound model where each layer is 1000 copies of the simple model layer, and the inter-layer connectivity is defined so that each cell in each layer is connected only to the cells in adjacent layers that correspond to the same simple model. The compound model should execute more efficiently on a GPU.

Then, you will also need to figure out how to configure the input layer to feed the correct inputs in parallel to the model, presumably by concatenating a 1000-wide batch.

Does this make sense? Perhaps if you post TensorFlow source for your simple model someone may be able to give you more specific advice on how to construct such a compound model.

(It's also possible that you are trying to solve the wrong problem, and instead of training 30k separate tiny RNNs you should be training one somewhat larger network with some additional configuration-inputs.)

Paul Tucker
  • 860
  • 5
  • 2
  • Updated the post with the code for a single model. I think we do need the tiny RNNs, and many of them. Thank you! – user1274878 Jul 14 '16 at 22:25
  • Is there any example file for this? this would be really interesting. Also to know how many of such networks could I train in one GPU simultaneously. – silgon Nov 22 '17 at 18:51