Understanding multi-layer LSTM

Question

I'm trying to understand and implement multi-layer LSTM. The problem is i don't know how they connect. I'm having two thoughs in mind:

At each timestep, the hidden state H of the first LSTM will become the input of the second LSTM.
At each timestep, the hidden state H of the first LSTM will become the initial value for the hidden state of the sencond LSTM, and the input of the first LSTM will become the input for the second LSTM.

Please help!

your first thought is correct – Hugh Perkins Jun 08 '18 at 04:01 — Hugh Perkins, Jun 08 '18 at 04:01

score 3 · Answer 1 · answered Aug 09 '18 at 12:46

TLDR: Each LSTM cell at time t and level l has inputs x(t) and hidden state h(l,t) In the first layer, the input is the actual sequence input x(t), and previous hidden state h(l, t-1), and in the next layer the input is the hidden state of the corresponding cell in the previous layer h(l-1,t).

From https://arxiv.org/pdf/1710.02254.pdf:

To increase the capacity of GRU networks (Hermans and Schrauwen 2013), recurrent layers can be stacked on top of each other. Since GRU does not have two output states, the same output hidden state h'2 is passed to the next vertical layer. In other words, the h1 of the next layer will be equal to h'2. This forces GRU to learn transformations that are useful along depth as well as time.

Tushar Gupta · Answer 2 · 2017-11-24T14:33:47.383

2

I am taking help of colah's blog post, just that I will cut short it to make you understand specific part.

As you can look at above image, LSTMs have this chain like structure and each have four neural network layer.

The values that we pass to next timestamp (cell state) and to next layer(hidden state) are basically same and they are desired output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to pass.

We also pass previous cell state information (top arrow to next cell) to next timestamp(cell state) and then decide using sigmoid layer(forget gate layer), how much information we are going to keep taking help of new input and input from previous state.

Hope this helps.

edited Nov 24 '17 at 14:33

answered Nov 24 '17 at 07:03

Tushar Gupta

1,603
13
20

1

I'm thinking that you are misunderstanding, i think that three boxes in the image was the three timesteps of one LSTM cell, not a LSTM cell. – Khoa Ngo Nov 24 '17 at 12:09
Yeah, but i want to stack multiple LSTM cells :( – Khoa Ngo Nov 25 '17 at 03:13
@KhoaNgo the stacking can be horizontal as in imagine all the boxes on top of each other where the input of one box is the input of the next box with hidden state of its own. – Mina Nov 08 '19 at 19:13

score 0 · Answer 3 · answered Nov 08 '19 at 19:11

0

In PyTorch, multilayer LSTM's implementation suggests that the hidden state of the previous layer becomes the input to the next layer. So your first assumption is correct.

answered Nov 08 '19 at 19:11

Mina

738
1
6
26

score -1 · Answer 4 · answered Nov 24 '17 at 08:58

-1

There's no definite answer. It depends on your problem and you should try different things.

The simplest thing you can do is to pipe the output from the first LSTM (not the hidden state) as the input to the second layer of LSTM (instead of applying some loss to it). That should work in most cases.

You can try to pipe the hidden state as well but I didn't see it very often.

You can also try other combinations. Say for the second layer you input the output of the first layer and the original input. Or you link to the output of the first layer from the current unit and the previous.

It all depends on your problem and you need to experiment to see what works for you.

answered Nov 24 '17 at 08:58

Sorin

11,863
22
26

What is actually the output of LSTM? I read some post about LSTM and i see that they produce the memory Ct and the hidden state Ht? – Khoa Ngo Nov 24 '17 at 12:11
@KhoaNgo Depends on the model. Ct is usually the nodes that you need to output the step prediction (next character or next token, if you train it that way) while the hidden state usually encodes that the state so far (i.e. the entire word/sentence). In the end you still have some number of neurons that get trained and it's hard to say exactly what they mean (unless you train them to have a certain meaning). – Sorin Nov 30 '17 at 12:52

Understanding multi-layer LSTM

4 Answers4

Linked