LSTMLayer produces NaN values even before training it

Question

I'm currently trying to construct a LSTM network with Lasagne to predict the next step of noisy sequences. I first trained a stack of 2 LSTM layers for a while, but had to use an abysmally small learning rate (1e-6) because of divergence issues (that ultimately produced NaN values). The results were kind of disappointing, as the network produced smooth, out-of-phase versions of the input.

I then came to the conclusion I should use better parameter initialization than what is given by default. The goal was to start from a network that just mimics identity, since for strongly auto-correlated signal it should be a good first estimation of the next step (x(t) ~ x(t+1)), and to sprinkle a bit of noise on top of it.

import theano, numpy, lasagne
from theano import tensor as T
from lasagne.layers.recurrent import LSTMLayer, InputLayer, Gate
from lasagne.layers import DropoutLayer
from lasagne.nonlinearities import sigmoid, tanh, leaky_rectify
from lasagne.layers import get_output
from lasagne.init import GlorotNormal, Normal, Constant

floatX = 'float32'

# function to create a lstm that ~ propagate the input from start to finish off the bat
# should be a good start for a predictive lstm with high one-step autocorrelation
def create_identity_lstm(input, shape, orig_inp=None, noiselvl=0.01, G=10., mask_input=None):
    inp, out = shape
    # orig_inp is used to limit the number of units that are actually used to pass the input information from one layer to the other - the rest of the units should produce ~ 0 activation. 
    if orig_inp is None:
        orig_inp = inp
    # input gate
    inputgate = Gate(
                 W_in=GlorotNormal(noiselvl),
                 W_hid=GlorotNormal(noiselvl),
                 W_cell=Normal(noiselvl),
                 b=Constant(0.),
                 nonlinearity=sigmoid
                 )
    # forget gate
    forgetgate = Gate(
                 W_in=GlorotNormal(noiselvl),
                 W_hid=GlorotNormal(noiselvl),
                 W_cell=Normal(noiselvl),
                 b=Constant(0.),
                 nonlinearity=sigmoid
                 )
    # cell gate
    cell = Gate(
                 W_in=GlorotNormal(noiselvl),
                 W_hid=GlorotNormal(noiselvl),
                 W_cell=None,
                 b=Constant(0.),
                 nonlinearity=leaky_rectify
                 )
    # output gate
    outputgate = Gate(
                 W_in=GlorotNormal(noiselvl),
                 W_hid=GlorotNormal(noiselvl),
                 W_cell=Normal(noiselvl),
                 b=Constant(0.),
                 nonlinearity=sigmoid
                 )
    lstm = LSTMLayer(input, out, ingate=inputgate, forgetgate=forgetgate, cell=cell, outgate=outputgate, nonlinearity=leaky_rectify, mask_input=mask_input)
    # change matrices and biases
    # ingate - should return ~1 (matrices = 0, big bias)
    b_i = lstm.b_ingate.get_value()
    b_i[:orig_inp] += G
    lstm.b_ingate.set_value(b_i)
    # forgetgate - should return 0 (matrices = 0, big negative bias)
    b_f = lstm.b_forgetgate.get_value()
    b_f[:orig_inp] -= G
    b_f[orig_inp:] += G # to help learning future features, I preserve a large bias on "unused" units to help it remember stuff 
    lstm.b_forgetgate.set_value(b_f)
    # cell - should return x(t) (W_xc = identity, rest is 0)
    W_xc = lstm.W_in_to_cell.get_value()
    for i in xrange(orig_inp):
        W_xc[i, i] += 1.
    lstm.W_in_to_cell.set_value(W_xc)
    # outgate - should return 1 (same as ingate)
    b_o = lstm.b_outgate.get_value()
    b_o[:orig_inp] += G
    lstm.b_outgate.set_value(b_o)
    # done
    return lstm

I then use this lstm generation code to generate the following network:

# layers
#input + dropout
input = InputLayer((None, None, 7), name='input')
mask = InputLayer((None, None), name='mask')
drop1 = DropoutLayer(input, p=0.33)
#lstm1 + dropout
lstm1 = create_identity_lstm(drop1, (7, 1024), mask_input=mask)
drop2 = DropoutLayer(lstm1, p=0.33)
#lstm2 + dropout
lstm2 = create_identity_lstm(drop2, (1024, 128), orig_inp=7, mask_input=mask)
drop3 = DropoutLayer(lstm2, p=0.33)    
#lstm3
lstm3 = create_identity_lstm(drop3, (128, 7), orig_inp=7, mask_input=mask)

# symbolic variables and prediction
x = input.input_var
ma = mask.input_var
ma_reshape = ma.dimshuffle((0,1,'x'))
yhat = get_output(lstm3, deterministic=False)
yhat_det = get_output(lstm3, deterministic=True)
y = T.ftensor3('y')
predict = theano.function([x, ma], yhat_det)

Problem is, even without any training, this network produces garbage values and sometimes even a bunch of NaNs, right from the very first LSTM layer:

X = numpy.random.random((5, 10000, 7)).astype('float32')
Masks = numpy.ones(X.shape[:2], dtype='float32')
hid1 = get_output(lstm1, determistic=True)
get_hid1 = theano.function([x, ma], hid1)
h1 = get_hid1(X, Masks)
print numpy.isnan(h1).sum(axis=1).sum(axis=1) 
    array([6379520, 6367232, 6377472, 6376448, 6378496])
# even the first output value is garbage!
print h1[:,0,0] - X[:,0,0]
    array([-0.03898358, -0.10118812,  0.34877831, -0.02509735,  0.36689138], dtype=float32)

I don't get why, I checked each matrices and their values are fine, like I wanted them to be. I even tried to recreate each gate activations and the resulting hidden activations using the actual numpy arrays and they reproduce the input just fine. What did I do wrong there??

did you solve your problem eventually? I'm doing something similar and experience the same problem. — user667804, Feb 16 '16 at 13:20
Hey ! It's been a while since I worked on that and I don't remember much of what I've been able to achieve with this (not much I guess)... But I posted the same question on the lasagne google group and had a bunch of neat answers [here](https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/lasagne-users/Dqmj9BT9_rU/5W2VlZmEBAAJ). You might want to give it a shot. — Nathan, Feb 17 '16 at 23:06
thanks, I saw that and it turned out that I was using np.ndarray instead of np.zeros at one place (creating nans instead of zeros). Found it by inserting a lot of asserts everywhere. But thank you for replying and providing the link! — user667804, Feb 18 '16 at 00:32

LSTMLayer produces NaN values even before training it

0 Answers0