Is it advisable to save the final state from training of an RNN to initialize it during testing?

Question

After training a RNN does it makes sense to save the final state so that it is then the initial state for testing?

I am using:

stacked_lstm = rnn.MultiRNNCell([rnn.BasicLSTMCell(n_hidden,state_is_tuple=True) for _ in range(number_of_layers)], state_is_tuple=True)

David Parks · Accepted Answer · 2020-08-19T15:42:29.670

The state has a very specific meaning and purpose. This isn't a question of "advisable" or not, there's a right and wrong answer here, and it depends on your data.

Consider each timestep in your sequence of data. At the first time step your state should be initialized to all zeros. This value has a specific meaning, it tells the network that this is the beginning of your sequence.

At each time step the RNN is computing a new state. The MultiRNNCell implementation in tensorflow is hiding this from you, but internally in that function a new hidden state is computed at each time step and passed forward.

The value of state at the 2nd time step is the output of the state at the 1st time step, and so on and so forth.

So the answer to your question is yes only if the next batch is continuing in time from the previous batch. Let me explain this with a couple of examples where you do, and don't perform this operation respectively.

Example 1: let's say you are training a character RNN, a common tutorial example where your input is each character in the works of Shakespear. There are millions of characters in this sequence. You can't train on a sequence that long. So you break your sequence into segments of 100 (if you don't know why to do otherwise limit your sequences to roughly 100 time steps). In this example, each training step is a sequence of 100 characters, and is a continuation of the last 100 characters. So you must carry the state forward to the next training step.

Example 2: where this isn't use would be in training an RNN to recognize MNIST handwritten digits. In this case you split your image into 28 rows of 28 pixels and each training has only 28 time steps, one per row in the image. In this case each training iteration starts at the beginning of the sequence for that image and trains fully until the end of the sequence for that image. You would not carry the hidden state forward in this case, your hidden state must start with zero's to tell the system that this is the beginning of a new image sequence, not the continuation of the last image you trained on.

I hope those two examples illustrate the important difference there. Know that if you have sequence lengths that are very long (say over ~100 timesteps) you need to break them up and think through the process of carrying forward the state appropriately. You can't effectively train on infinitely long sequence lengths. If your sequence lengths are under this rough threshold then you won't worry about this detail and always initialize your state to zero.

Also know that even though you only train on say 100 timesteps at a time the RNN can still be expected to learn patterns that operate over longer sequences, Karpathy's fabulous paper/blog on "The unreasonable effectiveness of RNNs" demonstrates this beautifully. Those character level RNNs can keep track of important details like whether a quote is open or not over many hundreds of characters, far more than were ever trained on in one batch, specifically because the hidden state was carried forward in the appropriate manner.

Thanks very much for clarifying this conceptual question. I'm trying the RNN as a day ahead energy price predictor. The inputs are weather conditions, previous prices, renewables output, time of the day, etc. My training set is the hourly day ahead price for a year. The test set is the subsequent day ahead price events. Hopefully I'm not wrong by using an RNN and based on your explanation I think its necessary to pass the output state of the training as initial state for testing. I'm still wondering about the sequence length because in my case is ~8000 — Arraval, Nov 21 '17 at 21:00
In training you do need to split up the sequence length of 8000 into blocks of ~100, you'll have a vanishing gradient problem at sequence lengths of 8000. For your test set I would simplify life and initialize the sequence with 0's at the beginning of your sequence, you can probably compute all 8000 sequences at once in your test set, so no need to carry the hidden state forward, but if you can't then you need to carry the state forward as you work through the sequence. — David Parks, Nov 21 '17 at 21:06
If you wanted to get fancy you could throw away your first few timesteps from your test data and use them to pre-compute the hidden state. Given that you aren't giving the network any prior knowledge for those first few time steps you might consider them unreasonable to test against. But honestly, that's probably more nitpicky than is really necessary and I would probably ignore it. What you would **not** do is carry the state from training into test, that would make no sense. The state is only ever tied to its specific sequence. And you have a unique state per sequence you're feeding the net. — David Parks, Nov 21 '17 at 21:07
If I'm understanding what you are saying then by block size of 100 you are referring to the batch size? So what Ive done is to train the RNN with one time step at a time, ie my input consist of a single time slot not the 8000 that Ive got in the training set (I have a loop the feeds one at a time). I know this sounds like a very naive approach but Im still trying to understand RNN and Tensorflow. I thought it would be easier for me to interpret the results in this way. Would it be better to feed the RNN with 100 time slots? — Arraval, Nov 21 '17 at 21:15
I had to think about that for a moment. You shouldn't train with just 1 timestep, you end up getting a gradient only from the loss function, in that case, you don't get a second gradient passed backward in time from the subsequent time step if I'm thinking it out correctly. You want to train over multiple time steps in each update, the more the better until you encounter the vanishing gradient, for which 100 seems reasonable (see: http://imgur.com/gallery/vaNahKE). To simplify life I would take a batch size of 1, and train on a set of 100 time-steps at a time. — David Parks, Nov 21 '17 at 23:29
Most people will train with a decent batch size and ~100 time steps. But in those cases you'll have to also deal with variable length sequences (https://r2rt.com/recurrent-neural-networks-in-tensorflow-iii-variable-length-sequences.html), which you can safely sidestep for simplicity by using a batch size of 1 to get things off the ground, that should only cost you some speed, not much in accuracy. — David Parks, Nov 21 '17 at 23:31

Is it advisable to save the final state from training of an RNN to initialize it during testing?

1 Answers1