2

I have developed an Encoder(CNN)-Decoder (RNN) network for image captioning in pytorch. The decoder network takes in two inputs- Context feature vector from the Encoder and the word embeddings of the caption for training. The context feature vector is of size = embed_size , which is also the embedding size of each word in the caption. My question here is more concerned with the output of the Class DecoderRNN. Please refer to the code below.

class DecoderRNN(nn.Module):
def __init__(self, embed_size, hidden_size, vocab_size, num_layers=1):
    super(DecoderRNN, self).__init__()
    self.embed_size = embed_size
    self.hidden_size = hidden_size
    self.vocab_size = vocab_size
    self.num_layers = num_layers
    self.linear = nn.Linear(hidden_size, vocab_size)
    self.embed = nn.Embedding(vocab_size, embed_size)
    self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first = True)
    

def forward(self, features, captions):
    embeddings = self.embed(captions)
    embeddings = torch.cat((features.unsqueeze(1), embeddings),1)
    hiddens,_ = self.lstm(embeddings)
    outputs = self.linear(hiddens)
    return outputs

In the forward function, I send in a sequence of (batch_size, caption_length+1, embed_size) (concatenated tensor of context feature vector and the embedded caption) . The output of the sequence should be captions and of the shape (batch_size, caption_length, vocab_size), but I am still receiving an output of shape (batch_size, caption_length +1, vocab_size).

What should I alter in my forward function so that extra 2nd dimension is not received?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Vineet Pandey
  • 1,386
  • 2
  • 9
  • 12

1 Answers1

1

Since in LSTM (or in any RNN) for each time step (or caption length here), there will be one output, I do not see any problem here. What you need to do is make input size (caption_length) at the second dimension to get the required output. (or people usually add a < END of SENTENCE > tag to target. Hence the target length is caption+1)

Umang Gupta
  • 15,022
  • 6
  • 48
  • 66
  • 1
    I found a solution to the issue. The issue that I was facing here was that the output should have had a length of (batch_size, caption_length, vocab_size) but I was getting the extra dimension. Apparently, the first batch of the output can be ignored as it doesn't contribute to the caption generation but merely initiates the process. Hence, an extra statement : `hiddens = hiddens[:,1:,:]` can be used. – Vineet Pandey Aug 02 '18 at 23:03