Last time step's state vs all time steps' state of RNN/LSTM/GRU

Question

Based on my understanding so far, after training a RNN/LSTM model for sequence classification task I can do prediction in following two ways,

Take the last state and make prediction using a softmax layer
Take all time step's states, make prediction at each time step and take the maximum after summing predictions

In general, is there any reason to choose one over another? Or this is application dependent? Also if I decide to use second strategy should I use different softmax layers for each time step or one softmax layer for all time steps?

score 2 · Accepted Answer · answered Feb 05 '18 at 23:17

I have never seen any network that implements the second approach. The most obvious reason is that all states except for the last one haven't seen the whole sequence.

Take, for example, review sentiment classification. It can start with few positive aspects, after which goes a "but" with a list of drawbacks. All RNN cells before the "but" are going to be biased and their state won't reflect the true label. Does it matter how many of them output positive class and how confident they are? The last cell output would be a better predictor anyway, so I don't see a reason to take the previous ones into account.

If the sequential of the data aspect is not important in a particular problem, then RNN doesn't seem like a good approach in general. Otherwise, you should better use the last state.

There is, however, one exception in sequence-to-sequence models with attention mechanism (see for instance this question). But it is different, because the decoder is predicting a new token on each step, so it can benefit from looking at earlier states. Besides it takes the final hidden state information as well.

Ok. Thanks. That makes sense. How about 2 layer RNN case: Taking final layer's final state vs maximum of 1st layer's final state added to final layer's state (I meant prediction score using state)? — amin__, Feb 05 '18 at 23:27
Deeper RNNs aren't much different actually. The early cells are still not connected to the data in the next time steps. The first layer can be seen as feature extractor, producing a new sequence for analysis. BTW If you're interested, you might take a look at bidirectional RNNs. — Maxim, Feb 05 '18 at 23:31

Last time step's state vs all time steps' state of RNN/LSTM/GRU

1 Answers1