2

I'm trying to implement a neural network to generate sentences (image captions), and I'm using Pytorch's LSTM (nn.LSTM) for that.

The input I want to feed in the training is from size batch_size * seq_size * embedding_size, such that seq_size is the maximal size of a sentence. For example - 64*30*512.

After the LSTM there is one FC layer (nn.Linear). As far as I understand, this type of networks work with hidden state (h,c in this case), and predict the next word each time.

My question is- in the training - do we have to manually feed the sentence word by word to the LSTM in the forward function, or the LSTM knows how to do it itself?

My forward function looks like this:

    def forward(self, features, caption, h = None, c = None):
        batch_size = caption.size(0)
        caption_size = caption.size(1)
        
        no_hc = False
        if h == None and c == None:
            no_hc = True
            h,c = self.init_hidden(batch_size)
        
        embeddings = self.embedding(caption)  
        output = torch.empty((batch_size, caption_size, self.vocab_size)).to(device)

        for i in range(caption_size): #go over the words in the sentence
            if i==0:
                lstm_input = features.unsqueeze(1)
            else: 
                lstm_input = embeddings[:,i-1,:].unsqueeze(1)
            
            out, (h,c) = self.lstm(lstm_input, (h,c))
            out = self.fc(out)

            output[:,i,:] = out.squeeze()
        
        if no_hc:
            return output

        return output, h,c    

(took inspiration from here)

The output of the forward here is from size batch_size * seq_size * vocab_size, which is good because it can be compared with the original batch_size * seq_size sized caption in the loss function.

The question is whether this for loop inside the forward that feeds the words one after the other is really necessary, or I can somehow feed the entire sentence at once and get the same results?

(I saw some example that do that, for example this one, but I'm not sure if it's really equivalent)

Shir
  • 1,157
  • 13
  • 35

1 Answers1

1

The answer is, LSTM knows how to do it on its own. You do not have to manually feed each word one by one. An intuitive way to understand is that the shape of the batch that you send, contains seq_length (batch.shape[1]), using which it decides the number of words in the sentence. The words are passed through LSTM Cell generating the hidden states and C.

harshraj22
  • 130
  • 1
  • 10
  • Thanks! do you know what happens if I need to use Attention in my model? (e.g. nn.multiheadAttention) Will it still work to do it one after the other, or I need the loop? – Shir Jan 02 '22 at 20:56
  • 1
    For using attention, you have to define which vectors are to be used as keys, queries and values. If words are to be considered as queries & values, [LSTM in PyTorch](https://pytorch.org/docs/1.9.1/generated/torch.nn.LSTM.html) returns three items: `output`, `h_n`, `c_n`. While `h_n` & `c_n` contains the information from the last word sent to `LSTM Cell`, `output` contains aggregated hidden state outputs (look closely, its shape contains `seq_len` as one of the dimentions). You can use the tensors corresponding to each word in the `seq_len` for queries and values. @Shir – harshraj22 Jan 03 '22 at 07:31
  • Thanks harshraj! If I use attention, do I need to feed the words one by one instead of altogether? because from what I understand I need to run a step of the attention and then a step of the attention at each time, so if I understand correctly- the loop in the forward is required there. Is it true? – Shir Jan 03 '22 at 08:08
  • 1
    @Shir I fail to understand what made you think that for using attention you need to send words one by one, but the answer is No. Even when using `nn.MultiheadAttention` you do not need to loop over the words. If you see the [documentation](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html) for the shapes of input tensors, they expect a dimention `L` (or `S`) which represents the `seq_len`. The whole batch of words is to be passed at once, and it would internally do all the required work. – harshraj22 Jan 03 '22 at 08:16
  • In order to add `nn.multihesdattention` to my model- do I need to add it before the embedding and lstm, and then send the inputs to lstm? I read the attention shall get input dimentions `batch* seq_len * embed_size`, so I thought I should duplicate the features (image) `seq` times. And then I thought I should concat the output with the caption to feed the lstm, but then the output of the lstm is `batch * 2seq_len * embed`, and I wasn't sure how to decrease the size of the middle dimention. Thabk you very much for the help! – Shir Jan 03 '22 at 09:05
  • I opened another question about the Attention - https://stackoverflow.com/questions/70569962/how-to-use-nn-multiheadattention-together-with-nn-lstm – Shir Jan 03 '22 at 19:13