Decoder always predicts the same token

Question

I have the following decoder for machine translation that after a few steps only predicts the EOS token. Overfitting on a dummy, tiny dataset is impossible because of this so it seems that there is a big error in the code.

Decoder(
  (embedding): Embeddings(
    (word_embeddings): Embedding(30002, 768, padding_idx=3)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (ffn1): FFN(
    (dense): Linear(in_features=768, out_features=512, bias=False)
    (layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.5, inplace=False)
    (activation): GELU()
  )
  (rnn): GRU(512, 512, batch_first=True, bidirectional=True)
  (ffn2): FFN(
    (dense): Linear(in_features=1024, out_features=512, bias=False)
    (layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.5, inplace=False)
    (activation): GELU()
  )
  (selector): Sequential(
    (0): Linear(in_features=512, out_features=30002, bias=True)
    (1): LogSoftmax(dim=-1)
  )
)

The forward is relatively straightforward (see what I did there?): pass the input_ids to the embedding and a FFN, then use that representation in the RNN with the given sembedding as initial hidden state. Pass the output through another FFN and do softmax. Return logits and last hidden states of the RNN. In the next step, use those hidden states as the new hidden states, and the highest predicted token as the new input.

def forward(self, input_ids, sembedding):
    embedded = self.embedding(input_ids)
    output = self.ffn1(embedded)
    output, hidden = self.rnn(output, sembedding)
    output = self.ffn2(output)
    logits = self.selector(output)

    return logits, hidden

sembedding is the initial hidden_state for the RNN. This is similar to an encoder-deocder architecture only here we do not train the encoder but we do have access to pretrained encoder representations.

In my training loop I start off each batch with a SOS token and feed every top predicted token to next step until target_len is reached. I also swap randomly between teacher forced training.

def step(self, batch, teacher_forcing_ratio=0.5):
    batch_size, target_len = batch["input_ids"].size()[:2]
    # Init first decoder input woth SOS (BOS) token
    decoder_input = torch.tensor([[self.tokenizer.bos_token_id]] * batch_size).to(self.device)
    batch["input_ids"] = batch["input_ids"].to(self.device)

    # Init first decoder hidden_state: one zero'd second embedding in case the RNN is bidirectional
    decoder_hidden = torch.stack((batch["sembedding"],
                                  torch.zeros(*batch["sembedding"].size()))
                                 ).to(self.device) if self.model.num_directions == 2 \
        else batch["sembedding"].unsqueeze(0).to(self.device)

    loss = torch.tensor([0.]).to(self.device)

    use_teacher_forcing = random.random() < teacher_forcing_ratio
    # contains tuples of predicted and correct words
    tokens = []
    for i in range(target_len):
        # overwrite previous decoder_hidden
        output, decoder_hidden = self.model(decoder_input, decoder_hidden)
        batch_correct_ids = batch["input_ids"][:, i]

        # NLLLoss compute loss between predicted classes (bs x classes) and correct classes for _this word_
        # set to ignore the padding index
        loss += self.criterion(output[:, 0, :], batch_correct_ids)

        batch_predicted_ids = output.topk(1).indices.squeeze(1).detach()

        # if use teacher training: use current correct word for next prediction
        # else do NOT use teacher training: us current predction for next prediction
        decoder_input = batch_correct_ids.unsqueeze(1) if use_teacher_forcing else batch_predicted_ids

    return loss, loss.item() / target_len

I also clip the gradients after each step:

clip_grad_norm_(self.model.parameters(), 1.0)

At first subsequent predictions are already relatively identical, but after a few iterations there's a bit more variation. But relatively quickly ALL predictions turn into other words (but always the same ones), eventually turning into EOS tokens (edit: after changing the activation to ReLU, another token is always predicted - it seems like a random token that always gets repeated). Note that this already happens after 80 steps (batch_size 128).

I found that the returned hidden state of the RNN contains a lot of zeros. I am not sure if that is the problem but it seems like it could be related.

tensor([[[  3.9874e-02,  -6.7757e-06,   2.6094e-04,  ...,  -1.2708e-17,
            4.1839e-02,   7.8125e-03],
         [ -7.8125e-03,  -2.5341e-02,   7.8125e-03,  ...,  -7.8125e-03,
           -7.8125e-03,  -7.8125e-03],
         [ -0.0000e+00, -1.0610e-314,   0.0000e+00,  ...,   0.0000e+00,
            0.0000e+00,   0.0000e+00],
         [  0.0000e+00,   0.0000e+00,   0.0000e+00,  ...,   0.0000e+00,
           -0.0000e+00,  1.0610e-314]]], device='cuda:0', dtype=torch.float64,
       grad_fn=<CudnnRnnBackward>)

I have no idea what might be going wrong although I suspect that the issue is rather with my step than with the model. I already tried playing with the learning rate, disabling some layers (LayerNorm, dropout, ffn2), using pretrained embeddings and freezing or unfreezing them, and disabling teacher forcing, using bidrectional vs unidirectional GRU. The end result is always the same.

If you have any pointers, that would be very helpful. I have googled many things concerning neural networks always predicting the same item and I have tried all the suggestions that I could find. Any new ones, no matter how crazy, are welcome!

Having `bidirectional=True` probably does not make that much sense for a decoder RNN. — Palle, Sep 13 '20 at 11:42
I believe that `use_teacher_forcing` should be randomly generated for each decoding step, not for the whole sequence (see https://arxiv.org/pdf/1506.03099.pdf Fig. 1). But that shouldn't cause the issue, I've made that mistake myself in the past and it still worked. Have you tried turning off gradient clipping? — Palle, Sep 13 '20 at 11:52
@Palle Thanks for the info, but disabling use_teacher_forcing does not change anything anyway. Disabling gradient clipping also does not make a difference. — Bram Vanroy, Sep 13 '20 at 11:56
What kind of data are you using? Your model may be a little too complex for it. If that's not the issue, then you may want to check if the gradients are flowing when you train the model. You can use the code [here](https://discuss.pytorch.org/t/check-gradient-flow-in-network/15063/7?u=seankala) to try it out. — Sean, Sep 13 '20 at 11:59
@Seankala Thanks for that. I just tried it and the graphs are relatively empty with a small gradient for the selector.weight. I am not sure what that means though, or how I can solve that. — Bram Vanroy, Sep 13 '20 at 12:05

Bram Vanroy · Accepted Answer · 2020-09-15T10:14:45.743

1

In my case the issue appeared to be that the dtype of the initial hidden state was a double and the input was a float. I don't quite understand why that is an issue, but casting the hidden state to a float solved the issue. If you have any intuition about why this might be a problem for PyTorch, do let me know in the comments or, better yet, on the official PyTorch forums.

EDIT: as that topic shows, this is a bug in PyTorch 1.6 that is solved in 1.7, In 1.7, you will get an error message which will hopefully save you the trouble of debugging all your code and not finding what causes strange behaviour.

edited Sep 15 '20 at 10:14

answered Sep 14 '20 at 11:49

Bram Vanroy

27,032
24
137
239

+1) As someone who also frequently struggles with models predicting the same value, this is something that I absolutely had zero idea might be a problem (as you also implied). I managed to find [this discussion thread](https://discuss.pytorch.org/t/floattensor-and-doubletensor/28553) on the PyTorch Discussion Forum but I'm also not sure if it directly addresses the issue. I'll have to experiment with this myself. – Sean Sep 14 '20 at 12:11
@Seankala I had no idea either indeed, and only came unto this by accident - but I am glad I did. I added a link to a topic that I just made on the PyTorch forums. You can keep an eye on that, perhaps someone has an idea about the how and why over there. – Bram Vanroy Sep 14 '20 at 12:51

Decoder always predicts the same token

1 Answers1