Does Theano do automatic unfolding for BPTT?

Question

I am implementing an RNN in Theano and I have difficulties training it. It doesn't even come near to memorising the training corpus. My mistake is most likely caused by me not understanding exactly how Theano copes with backpropagation through time. Right now, my code is as simple as it gets:

grad_params = theano.tensor.grad(cost, params)

My question is: given that my network is recurrent, does this automatically do the unfolding of the architecture into a feed-forward one? On one hand, this example does exactly what I am doing. On the other hand, this thread makes me think I'm wrong.

In case it does do the unfolding for me, how can I truncate it? I can see that there is a way, from the documentation of scan, but I can't come up with the code to do it.

score 10 · Accepted Answer · edited Aug 01 '17 at 07:44

I wouldn't say it does automatic "unfolding" - rather, Theano has a notion of what variables are connected, and can pass updates along that chain. If this is what you mean by unfolding, then maybe we are talking about the same thing.

I am stepping through this as well, but using Rasvan Pascanu's rnn.py code (from this thread) for reference. It seems much more straightforward for a learning example.

You might gain some value from visualizing/drawing graphs from the tutorial. There is also set of slides online with a simple drawing that shows the diagram from a 1 layer "unfolding" of an RNN, which you discuss in your post.

Specifically, look at the step function:

def step(u_t, h_tm1, W, W_in, W_out):
    h_t = TT.tanh(TT.dot(u_t, W_in) + TT.dot(h_tm1, W))
    y_t = TT.dot(h_t, W_out)
    return h_t, y_t

This function represents the "simple recurrent net" shown in these slides, pg 10. When you do updates, you simply pass the gradient w.r.t. W, W_in, and W_out, respectively (remember that y is connected to those three via the step function! This is how the gradient magic works).

If you had multiple W layers (or indexes into one big W, as I believe gwtaylor is doing), then that would create multiple layers of "unfolding". From what I understand, this network only looks 1 step backward in time. If it helps, theanonets also has an RNN implementation in Theano.

As an additional note, training RNNs with BPTT is hard. Ilya Sutskever's dissertation discusses this at great length - if you can, try to tie into a Hessian Free optimizer, there is also a reference RNN implementation here. Theanets also does this, and may be a good reference.

This is not really answering the question on how the last n activations can be accessed. — runDOSrun, Apr 15 '16 at 06:08
This returns all activations for the given step function, for only the last n you can just slice the result e.g. (h, y), updates = scan(...); my_act = h[n:] — Kyle Kastner, Apr 15 '16 at 15:29

Does Theano do automatic unfolding for BPTT?

1 Answers1