LSTM cell implementation in Pytorch design choices

Question

I was looking for an implementation of an LSTM cell in Pytorch that I could extend, and I found an implementation of it in the accepted answer here. I will post it here because I'd like to refer to it. There are quite a few implementation details that I do not understand, and I was wondering if someone could clarify.

import math
import torch as th
import torch.nn as nn

class LSTM(nn.Module):

    def __init__(self, input_size, hidden_size, bias=True):
        super(LSTM, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.bias = bias
        self.i2h = nn.Linear(input_size, 4 * hidden_size, bias=bias)
        self.h2h = nn.Linear(hidden_size, 4 * hidden_size, bias=bias)
        self.reset_parameters()

    def reset_parameters(self):
        std = 1.0 / math.sqrt(self.hidden_size)
        for w in self.parameters():
            w.data.uniform_(-std, std)

    def forward(self, x, hidden):
        h, c = hidden
        h = h.view(h.size(1), -1)
        c = c.view(c.size(1), -1)
        x = x.view(x.size(1), -1)

        # Linear mappings
        preact = self.i2h(x) + self.h2h(h)

        # activations
        gates = preact[:, :3 * self.hidden_size].sigmoid()
        g_t = preact[:, 3 * self.hidden_size:].tanh()
        i_t = gates[:, :self.hidden_size]
        f_t = gates[:, self.hidden_size:2 * self.hidden_size]
        o_t = gates[:, -self.hidden_size:]

        c_t = th.mul(c, f_t) + th.mul(i_t, g_t)

        h_t = th.mul(o_t, c_t.tanh())

        h_t = h_t.view(1, h_t.size(0), -1)
        c_t = c_t.view(1, c_t.size(0), -1)
        return h_t, (h_t, c_t)

1- Why multiply the hidden size by 4 for both self.i2h and self.h2h (in the init method)

2- I don't understand the reset method for the parameters. In particular, why do we reset parameters in this way?

3- Why do we use view for h, c, and x in the forward method?

4- I'm also confused about the column bounds in the activations part of the forward method. As an example, why do we upper bound with 3 * self.hidden_size for gates?

5- Where are all the parameters of the LSTM? I'm talking about the Us and Ws here:

score 3 · Accepted Answer · answered May 30 '20 at 18:46

1- Why multiply the hidden size by 4 for both self.i2h and self.h2h (in the init method)

In the equations you have included, the input x and the hidden state h are used for four calculations, where each of them is a matrix multiplication with a weight. Whether you do four matrix multiplications or concatenate the weights and do one bigger matrix multiplication and separate the results afterwards, has the same result.

input_size = 5
hidden_size = 10

input = torch.randn((2, input_size))

# Two different weights
w_c = torch.randn((hidden_size, input_size))
w_i = torch.randn((hidden_size, input_size))

# Concatenated weights into one tensor
# with size:[2 * hidden_size, input_size]
w_combined = torch.cat((w_c, w_i), dim=0)

# Output calculated by using separate matrix multiplications
out_c = torch.matmul(w_c, input.transpose(0, 1))
out_i = torch.matmul(w_i, input.transpose(0, 1))

# One bigger matrix multiplication with the combined weights
out_combined = torch.matmul(w_combined, input.transpose(0, 1))
# The first hidden_size number of rows belong to w_c
out_combined_c = out_combined[:hidden_size]
# The second hidden_size number of rows belong to w_i
out_combined_i = out_combined[hidden_size:]

# Using torch.allclose because they are equal besides floating point errors.
torch.allclose(out_c, out_combined_c) # => True
torch.allclose(out_i, out_combined_i) # => True

By setting the output size of the linear layer to 4 * hidden_size there are four weights with size hidden_size, so only one layer is needed instead of four. There is not really an advantage of doing this, except maybe a minor performance improvement, mostly for smaller inputs that don't fully exhaust the parallelisations capabilities if done individually.

4- I'm also confused about the column bounds in the activations part of the forward method. As an example, why do we upper bound with 3 * self.hidden_size for gates?

That's where the outputs are separated to correspond to the output of the four individual calculations. The output is the concatenation of [i_t; f_t; o_t; g_t] (not including tanh and sigmoid respectively).

You can get the same separation by splitting the output into four chunks with torch.chunk:

i_t, f_t, o_t, g_t = torch.chunk(preact, 4, dim=1)

But after the separation you would have to apply torch.sigmoid to i_t, f_t and o_t, and torch.tanh to g_t.

5- Where are all the parameters of the LSTM? I'm talking about the Us and Ws here:

The parameters W are the weights in the linear layer self.i2h and U in the linear layer self.h2h, but concatenated.

W_i, W_f, W_o, W_c = torch.chunk(self.i2h.weight, 4, dim=0)

U_i, U_f, U_o, U_c = torch.chunk(self.h2h.weight, 4, dim=0)

3- Why do we use view for h, c, and x in the forward method?

Based on h_t = h_t.view(1, h_t.size(0), -1) towards the end, the hidden states have the size [1, batch_size, hidden_size]. With h = h.view(h.size(1), -1) that gets rid of the first singular dimension to get size [batch_size, hidden_size]. The same could be achieved with h.squeeze(0).

2- I don't understand the reset method for the parameters. In particular, why do we reset parameters in this way?

Parameter initialisation can have a big impact on the model's learning capability. The general rule for the initialisation is to have values close to zero without being too small. A common initialisation is to draw from a normal distribution with mean 0 and variance of 1 / n, where n is the number of neurons, which in turn means a standard deviation of 1 / sqrt(n).

In this case it uses a uniform distribution instead of a normal distribution, but the general idea is similar. Determining the minimum/maximum value based on the number of neurons but avoiding to make them too small. If the minimum/maximum value would be 1 / n the values would get very small, so using 1 / sqrt(n) is more appropriate, e.g. 256 neurons: 1 / 256 = 0.0039 whereas 1 / sqrt(256) = 0.0625.

Initializing neural networks provides some explanations of different initialisations with interactive visualisations.

Thanks a lot this was very helpful. I'm going to accept your answer but just one more clarification. You say that the hidden state has size (1, batch_size, hidden_size). Why is there a 1 in the first dimension in the first place? — An Ignorant Wanderer, May 30 '20 at 22:42
I can only speculate, but I'd say that's because it follows [`nn.LSTM`](https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM), where the hidden state has size *(num_layers \* num_directions, batch_size, hidden_size)* and a cell is only 1 direction and 1 layer, therefore it would be one. But in contrast [`nn.LSTMCell`](https://pytorch.org/docs/stable/nn.html#torch.nn.LSTMCell) omits that dimensions completely and has size *(batch_size, hidden_size)*, which makes more sense to me. — Michael Jungo, May 30 '20 at 23:03
Thanks that makes sense. Actually two more questions. From what I know, an LSTM cell takes as input both the hidden state and the cell state. In the forward method, the only input is "hidden". They then set h and c to be this "hidden". That seems odd that they're not differentiating between hidden state and cell state... — An Ignorant Wanderer, May 30 '20 at 23:45
My second question is about the return statement. Why did they return h_t, (h_t, c_t). Why return h_t twice? — An Ignorant Wanderer, May 30 '20 at 23:47
The hidden state and cell state are separate, but kept in a tuple, supposedly because they are complementary to each other and to keep the API similar to regular RNNs (no cell state), so unless you unpack the LSTM's hidden states, various RNNs can be interchanged seamlessly. `h_t` is the output as well as the hidden state, and again `nn.LSTMCell` omits it entirely. I recommend reading [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) to get a better understanding of LSTMs. — Michael Jungo, May 31 '20 at 00:12

LSTM cell implementation in Pytorch design choices

1 Answers1