LSTM with Attention

Question

I am trying to add attention mechanism to stacked LSTMs implementation https://github.com/salesforce/awd-lstm-lm

All examples online use encoder-decoder architecture, which I do not want to use (do I have to for the attention mechanism?).

Basically, I have used https://webcache.googleusercontent.com/search?q=cache:81Q7u36DRPIJ:https://github.com/zhedongzheng/finch/blob/master/nlp-models/pytorch/rnn_attn_text_clf.py+&cd=2&hl=en&ct=clnk&gl=uk

def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5, dropouth=0.5, dropouti=0.5, dropoute=0.1, wdrop=0, tie_weights=False):
    super(RNNModel, self).__init__()
    self.encoder = nn.Embedding(ntoken, ninp)
    self.rnns = [torch.nn.LSTM(ninp if l == 0 else nhid, nhid if l != nlayers - 1 else (ninp if tie_weights else nhid), 1, dropout=0) for l in range(nlayers)]
    for rnn in self.rnns:
        rnn.linear = WeightDrop(rnn.linear, ['weight'], dropout=wdrop)
    self.rnns = torch.nn.ModuleList(self.rnns)
    self.attn_fc = torch.nn.Linear(ninp, 1)
    self.decoder = nn.Linear(nhid, ntoken)

    self.init_weights()

def attention(self, rnn_out, state):
    state = torch.transpose(state, 1,2)
    weights = torch.bmm(rnn_out, state)# torch.bmm(rnn_out, state)
    weights = torch.nn.functional.softmax(weights)#.squeeze(2)).unsqueeze(2)
    rnn_out_t = torch.transpose(rnn_out, 1, 2)
    bmmed = torch.bmm(rnn_out_t, weights)
    bmmed = bmmed.squeeze(2)
    return bmmed

def forward(self, input, hidden, return_h=False, decoder=False, encoder_outputs=None):
    emb = embedded_dropout(self.encoder, input, dropout=self.dropoute if self.training else 0)
    emb = self.lockdrop(emb, self.dropouti)

    new_hidden = []
    raw_outputs = []
    outputs = []
    for l, rnn in enumerate(self.rnns):
        temp = []
        for item in emb:
            item = item.unsqueeze(0)
            raw_output, new_h = rnn(item, hidden[l])

            raw_output = self.attention(raw_output, new_h[0])

            temp.append(raw_output)
        raw_output = torch.stack(temp)
        raw_output = raw_output.squeeze(1)

        new_hidden.append(new_h)
        raw_outputs.append(raw_output)
        if l != self.nlayers - 1:
            raw_output = self.lockdrop(raw_output, self.dropouth)
            outputs.append(raw_output)
    hidden = new_hidden

    output = self.lockdrop(raw_output, self.dropout)
    outputs.append(output)

    outputs = torch.stack(outputs).squeeze(0)
    outputs = torch.transpose(outputs, 2,1)
    output = output.transpose(2,1)
    output = output.contiguous()
    decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
    result = decoded.view(output.size(0), output.size(1), decoded.size(1))
    if return_h:
        return result, hidden, raw_outputs, outputs
    return result, hidden

This model is training, but my loss is quite high as compared to the model without the attention model.

can you explain your need briefly rather than showing your code? your question is a bit confusing. if you can tell what is your idea of adding attention mechanism on top of the stacked RNN, I will be able to help you. also, what do you mean by this implementation of attention is missing one dimension? why you need that additional dimension? btw, you don't need to use encoder-decoder architecture to use attention. if you understand what attention means, you can use it anywhere. — Wasi Ahmad, Mar 07 '18 at 02:19
@WasiAhmad I need to modify the code from the link in the question (I use it for language modelling) to include attention mechanism so that I can compare the quality of trained models. In the original code, forward returns 5x5x831 tensor (batchesXlengthXdictionary). If I use the attention from my question, I get 5x831 dimension tensor, which is missing one dimension. I wondered if I could modify the attention function to bring the 3rd dimension back, but I think patapouf_ai instead suggests applying attention for every word in 'emb´ tensor — Boris Mocialov, Mar 08 '18 at 15:08
@WasiAhmad I have changed the forward function to feed word-by-word into the attention, but my results are worse than that of the model without the attention. I have edited my question — Boris Mocialov, Mar 09 '18 at 12:49

Wasi Ahmad · Answer 1 · 2018-08-20T06:26:59.280

I understood your question but it is a bit tough to follow your code and find the reason why the loss is not decreasing. Also, it is not clear why you want to compare the last hidden state of the RNN with all the hidden states at every time step.

Please note, a particular trick/mechanism is useful if you use it in the correct way. The way you are trying to use attention mechanism, I am not sure if it is the correct way. So, don't expect that since you are using attention trick in your model, you will get good results!! You should think, why attention mechanism will bring advantage to your desired task?

You didn't clearly mention what is that task you are targetting? Since you have pointed to a repo which contains code on language modeling, I am guessing the task is: given a sequence of tokens, predict the next token.

One possible problem I can see in your code is: in the for item in emb: loop, you will always use the embedddings as input to each LSTM layer, so having a stacked LSTM doesn't make sense to me.

Now, let me first answer your question and then show step-by-step how can you build your desired NN architecture.

Do I need to use encoder-decoder architecture to use attention mechanism?

The encoder-decoder architecture is better known as sequence-to-sequence to learning and it is widely used in many generation task, for example, machine translation. The answer to your question is no, you are not required to use any specific neural network architecture to use attention mechanism.

The structure you presented in the figure is little ambiguous but should be easy to implement. Since your implementation is not clear to me, I am trying to guide you to a better way of implementing it. For the following discussion, I am assuming we are dealing with text inputs.

Let's say, we have an input of shape 16 x 10 where 16 is batch_size and 10 is seq_len. We can assume we have 16 sentences in a mini-batch and each sentence length is 10.

batch_size, vocab_size = 16, 100
mat = np.random.randint(vocab_size, size=(batch_size, 10))
input_var = Variable(torch.from_numpy(mat))

Here, 100 can be considered as the vocabulary size. It is important to note that throughout the example I am providing, I am assuming batch_size as the first dimension in all respective tensors/variables.

Now, let's embed the input variable.

embedding = nn.Embedding(100, 50)
embed = embedding(input_var)

After embedding, we got a variable of shape 16 x 10 x 50 where 50 is the embedding size.

Now, let's define a 2-layer unidirectional LSTM with 100 hidden units at each layer.

rnns = nn.ModuleList()
nlayers, input_size, hidden_size = 2, 50, 100
for i in range(nlayers):
    input_size = input_size if i == 0 else hidden_size
    rnns.append(nn.LSTM(input_size, hidden_size, 1, batch_first=True))

Then, we can feed our input to this 2-layer LSTM to get the output.

sent_variable = embed
outputs, hid = [], []
for i in range(nlayers):
    if i != 0:
        sent_variable = F.dropout(sent_variable, p=0.3, training=True)
    output, hidden = rnns[i](sent_variable)
    outputs.append(output)
    hid.append(hidden[0].squeeze(0))
    sent_variable = output

rnn_out = torch.cat(outputs, 2)
hid = torch.cat(hid, 1)

Now, you can simply use the hid to predict the next word. I would suggest you do that. Here, shape of hid is batch_size x (num_layers*hidden_size).

But since you want to use attention to compute soft alignment score between last hidden states with each hidden states produced by LSTM layers, let's do this.

sent_variable = embed
hid, con = [], []
for i in range(nlayers):
    if i != 0:
        sent_variable = F.dropout(sent_variable, p=0.3, training=True)
    output, hidden = rnns[i](sent_variable)
    sent_variable = output

    hidden = hidden[0].squeeze(0) # batch_size x hidden_size
    hid.append(hidden)
    weights = torch.bmm(output[:, 0:-1, :], hidden.unsqueeze(2)).squeeze(2)  
    soft_weights = F.softmax(weights, 1)  # batch_size x seq_len
    context = torch.bmm(output[:, 0:-1, :].transpose(1, 2), soft_weights.unsqueeze(2)).squeeze(2)
    con.append(context)

hid, con = torch.cat(hid, 1), torch.cat(con, 1)
combined = torch.cat((hid, con), 1)

Here, we compute soft alignment score between the last state with all the states of each time step. Then we compute a context vector which is just a linear combination of all the hidden states. We combine them to form a single representation.

Please note, I have removed the last hidden states from output: output[:, 0:-1, :] since you are comparing with last hidden state itself.

The final combined representation stores the last hidden states and context vectors produced at each layer. You can directly use this representation to predict the next word.

Predicting the next word is straight-forward and as you are using a simple linear layer is just fine.

Edit: We can do the following to predict the next word.

decoder = nn.Linear(nlayers * hidden_size * 2, vocab_size)
dec_out = decoder(combined)

Here, the shape of dec_out is batch_size x vocab_size. Now, we can compute negative log-likelihood loss which will be used to backpropagate later.

Before computing the negative log-likelihood loss, we need to apply log_softmax to the output of the decoder.

dec_out = F.log_softmax(dec_out, 1)
target = np.random.randint(vocab_size, size=(batch_size))
target = Variable(torch.from_numpy(target))

And, we also defined the target which is required to compute the loss. See NLLLoss for details. So, now we can compute the loss as follows.

criterion = nn.NLLLoss()
loss = criterion(dec_out, target)
print(loss)

The printed loss value is:

Variable containing:
 4.6278
[torch.FloatTensor of size 1]

Hope the entire explanation helps you!!

Thank you for the extended answer. I am now trying your suggestion and I get the same problem as I was getting initially: my embed has dimensions 5x80x400, while hidden rnn state has dimensions 80x400, where 5 is the amount of words as an input (it is the next word prediction task), 80 is batch size, and 400 is embedding dimension. Therefore, weights = torch.bmm(...) cannot be computed as it requires the same batch sizes. That is why I tried patapouf_ai's suggestion - iterating through embed — Boris Mocialov, Mar 12 '18 at 12:25
To use bmm, I need to have rnn_output=(batch, seq_len, cell_size) and hidden=(batch, cell_size, 1). In my case, rnn_output batch != hidden batch — Boris Mocialov, Mar 12 '18 at 12:29
seems like adding "batch_first=True" to the LSTM definition solves the problem... — Boris Mocialov, Mar 12 '18 at 12:35
I think I will need some assistance with loss calculation. https://github.com/salesforce/awd-lstm-lm/blob/master/model.py#L94 passes rnn output through Linear decoder and then reshapes the output into (batches, batch_size, vocabulary_size). Both hid and con have dimensions (batches, embedding_dimension) accumulated over all rnn passes. Honestly, I am lost. — Boris Mocialov, Mar 12 '18 at 13:39
The only way I can see is to create Linear layer of size batches*batch_size*vocabulary_size and then reshape it, does this make sense? — Boris Mocialov, Mar 12 '18 at 14:26
I have updated my post. Please take a look. And yes, in throughout the example I provided, I assumed, you are using `batch_first=True` whenever you are using RNN. — Wasi Ahmad, Mar 12 '18 at 20:43
The problem that I am facing now is the dimensionality mismatch. The loss from the original repo expects 400 targets with 400x8967 predictions. In your code, the decoder outputs vocab_size, which results in 5xvocab_size tensor, but to match the original repo, I need 400xvocab_size tensor. I can make decoder output vocab_size*(400/5) tensor, but then I run out of memory. Alternative is to reduce the batch size, I suppose — Boris Mocialov, Mar 13 '18 at 10:36
@MocialovBoris please post a different question on the issue you are facing. in one post, you should concentrate only on one major issue. — Wasi Ahmad, Mar 13 '18 at 18:25

score 5 · Answer 2 · answered Mar 06 '18 at 09:22

The whole point of attention, is that word order in different languages is different and thus when decoding the 5th word in the target language you might need to pay attention to the 3rd word (or encoding of the 3rd word) in the source language because these are the words which correspond to each other. That is why you mostly see attention used with an encoder decoder structure.

If I understand correctly, you are doing next word prediction? In that case it might still make sense to use attention because the next word might highly depend on the word 4 steps in the past.

So basically what you need is:

rnn: which takes in input of shape MBxninp and hidden of shape MBxnhid and outputs h of shape MBxnhid.

h, next_hidden = rnn(input, hidden)

attention: which takes in sequence of h's and the last h_last decides how each of them is important by giving each a weight w.

w = attention(hs, h_last)

where w is of shape seq_len x MB x 1, hs is of shape seq_len x MB x nhid, and h_last is of shape MB x nhid.

Now you weight the hs by w:

h_att = torch.sum(w*hs, dim=0) #shape MB x n_hid

Now the point is you need to do that for every time step:

h_att_list = []
h_list = []
hidden = hidden_init
for word in embedded_words:
    h, hidden = rnn(word, hidden)
    h_list.append(h)
    h_att = attention(torch.stack(h_list), h)
    h_att_list.append(h_att)

And then you can apply the decoder (which might need to be an MLP rather than just a linear transformation) on h_att_list.

I have used your suggestion to feed word-by-word into the attention layer, but the performance of the model is rather poor. I have updated my question with the new results — Boris Mocialov, Mar 09 '18 at 12:48
though, in my approach, I pass only one rnn output to the attention layer, not the list, like you show.. I would have dimensionality problem doing bmm in my attention if I would pass a list — Boris Mocialov, Mar 09 '18 at 13:34
The whole point of attention is that it needs to choose which words it needs to pay attention to now. It then outputs weights for each word to say how important they are for the next step. If you have attention on a single word, then it is totally pointless. — patapouf_ai, Mar 09 '18 at 13:47
I recommend you try to implement this architecture: https://arxiv.org/abs/1706.03762 where your output sequence is your input sequence (so that you are basically doing next step prediction). The above answer is the simplest modification to your current code which would make sense I think, but this attention paper should give better results. — patapouf_ai, Mar 09 '18 at 13:50

LSTM with Attention

2 Answers2

Linked