0

I've constructed a LSTM recurrent NNet using lasagne that is loosely based on the architecture in this blog post. My input is a text file that has around 1,000,000 sentences and a vocabulary of 2,000 word tokens. Normally, when I construct networks for image recognition my input layer will look something like the following:

l_in = nn.layers.InputLayer((32, 3, 128, 128))

(where the dimensions are batch size, channel, height and width) which is convenient because all the images are the same size so I can process them in batches. Since each instance in my LSTM network has a varying sentence length, I have an input layer that looks like the following:

l_in = nn.layers.InputLayer((None, None, 2000))

As described in above referenced blog post,

Masks:
Because not all sequences in each minibatch will always have the same length, all recurrent layers in lasagne accept a separate mask input which has shape (batch_size, n_time_steps) , which is populated such that mask[i, j] = 1 when j <= (length of sequence i) and mask[i, j] = 0 when j > (length of sequence i) . When no mask is provided, it is assumed that all sequences in the minibatch are of length n_time_steps.

My question is: Is there a way to process this type of network in mini-batches without using a mask?


Here is a simplified version if my network.

# -*- coding: utf-8 -*-

import theano
import theano.tensor as T
import lasagne as nn

softmax = nn.nonlinearities.softmax

def build_model():
    l_in  = nn.layers.InputLayer((None, None, 2000))
    lstm  = nn.layers.LSTMLayer(l_in, 4096, grad_clipping=5)
    rs    = nn.layers.SliceLayer(lstm, 0, 0)
    dense = nn.layers.DenseLayer(rs, num_units=2000, nonlinearity=softmax)
    return l_in, dense

model = build_model()
l_in, l_out = model

all_params = nn.layers.get_all_params(l_out)
target_var = T.ivector("target_output")

output = nn.layers.get_output(l_out)
loss = T.nnet.categorical_crossentropy(output, target_var).sum()
updates = nn.updates.adagrad(loss, all_params, 0.005)

train = theano.function([l_in.input_var, target_var], cost, updates=updates)

From there I have generator that spits out (X, y) pairs and I am computing train(X, y) and updating the gradient with each iteration. What I want to do is do an N number of training steps and then update the parameters with the average gradient.

To do this, I tried creating a compute_gradient function:

gradient = theano.grad(loss, all_params)

compute_gradient = theano.function(
    [l_in.input_var, target_var],
    output=gradient
  )

and then looping over several training instances to create a "batch" and collect the gradient calculations to a list:

grads = []
for _ in xrange(1024):
    X, y = train_gen.next()  # generator for producing training data
    grads.append(compute_gradient(X, y))

this produces a list of lists

>>> grads
[[<CudaNdarray at 0x7f83b5ff6d70>,
<CudaNdarray at 0x7f83b5ff69f0>,
<CudaNdarray at 0x7f83b5ff6270>,
<CudaNdarray at 0x7f83b5fc05f0>],
[<CudaNdarray at 0x7f83b5ff66f0>,
<CudaNdarray at 0x7f83b5ff6730>,
<CudaNdarray at 0x7f83b5ff6b70>,
<CudaNdarray at 0x7f83b5ff64f0>] ...

From here I would need to take the mean of the gradient at each layer, and then update the model parameters. This is possible to do in pieces like this does does the gradient calc/parameter update need to happen all in one theano function?

Thanks.

o-90
  • 17,045
  • 10
  • 39
  • 63
  • wouldn't you, at compile time, need to define a theano function that takes batch_size gradients as inputs, takes the mean and applies the changes to the shared value params? – user2255757 Feb 05 '16 at 11:04
  • @user2255757 yes that sounds like what I am after. I'm just not sure how to go about doing that with a list of symbolic CudaNdarray instances. If they were numpy arrays with actual values in them I'd just do `map(np.mean, zip(*grads))` and then update the params but they aren't so I'm not sure how to proceed. – o-90 Feb 05 '16 at 15:34
  • i updates my answer in regards to your update of the question, hope it helps – user2255757 Feb 05 '16 at 21:36

1 Answers1

0

NOTE: this is a solution, but by no means do i have enough experience to verify its best and the code is just a sloppy example

You need 2 theano functions. The first being the grad one you seem to have already judging from the information provided in your question.

So after computing the batched gradients you want to immediately feed them as an input argument back into another theano function dedicated to updating the shared variables. For this you need to specify the expected batch size at the compile time of your neural network. so you could do something like this: (for simplicity i will assume you have a global list variable where all your params are stored)

params #list of params you wish to update
BATCH_SIZE = 1024 #size of the expected training batch
G = [T.matrix() for i in range(BATCH_SIZE) for param in params] #placeholder for grads result flattened so they can be fed into a theano function

updates = [G[i] for i in range(len(params))] #starting with list of  param updates from first batch

for i in range(len(params)): #summing the gradients for each individual param
     for j in range(1, len(G)/len(params)):
         updates[i] += G[i*BATCH_SIZE + j]

for i in range(len(params)): #making a list of tuples for theano.function updates argument
     updates[i] = (params[i], updates[i]/BATCH_SIZE) 
update = theano.function([G], 0, updates=updates)

Like this theano will be taking the mean of the gradients and updating the params as usual

dont know if you need to flatten the inputs as I did, but probably

EDIT: gathering from how you edited your question it seems important that the batch size can vary in that case you could add 2 theano functions to your existing one:

  1. the first theano function takes a batch of size 2 of your params and returns the sum. you could apply this theano function using python's reduce() and get the sum of the over the whole batch of gradients
  2. the second theano function takes those summed param gradients and a scaler (the batch size) as input and hence is able to update the NN params over the mean of the summed gradients.
user2255757
  • 756
  • 1
  • 6
  • 24
  • Sorry, been super-busy @ work. Thank you for the response; I'll look over it this week and get back to you. – o-90 Feb 16 '16 at 16:48
  • if you look at most online resources SGD doesnt sum the updates and takes the mean, they are just updated 1 by 1, it seems no different from normal gradient descent other than you are immediately feeding in multiple training cases – user2255757 Feb 17 '16 at 17:19