Repetitive word predictions in RNN

Question

Hello dear community,

I am training a Seq2Seq model to generate a question based on a graph. Both train and val loss are converging, but the generated questions (on either train or test set) are nonsense and contain mostly repetition of tokens. I tried various hyper parameters and double checked input and outputs tensors.

Something that I do find odd is that the output out (see below) starts containing some values, which I consider as unusually high. This starts happening around half way through the first epoch:

Out:  tensor([[  0.2016, 103.7198,  90.4739,  ...,   0.9419,   0.4810,  -0.2869]]

My guess for that is vanishing/exploding gradients, which I thought I had handeled by gradient clipping, but now I am not sure about this:

for p in model_params:
        p.register_hook(lambda grad: torch.clamp(
            grad, -clip_value, clip_value))

Below are the training curves (10K samples, batch size=128, lr=0.065, lr_decay=0.99, dropout=0.25)

Encoder (a GNN, learning node embeddings of the input graph, that consists of around 3-4 nodes and edges. A single graph embedding is obtained by pooling the node embeddings and feeding them as the initial hidden state to the Decoder):

class QuestionGraphGNN(torch.nn.Module):
    def __init__(self,
                 in_channels,
                 hidden_channels,
                 out_channels,
                 dropout,
                 aggr='mean'):
        super(QuestionGraphGNN, self).__init__()
        nn1 = torch.nn.Sequential(
            torch.nn.Linear(in_channels, hidden_channels),
            torch.nn.ReLU(),
            torch.nn.Linear(hidden_channels, in_channels * hidden_channels))
        self.conv = NNConv(in_channels, hidden_channels, nn1, aggr=aggr)
        self.lin = nn.Linear(hidden_channels, out_channels)
        self.dropout = dropout

    def forward(self, x, edge_index, edge_attr):
        x = self.conv(x, edge_index, edge_attr)
        x = F.leaky_relu(x)
        x = F.dropout(x, p=self.dropout)
        x = self.lin(x)
        return x

Decoder (The out vector from above is printed in the forward() function):

class DecoderRNN(nn.Module):
    def __init__(self,
                 embedding_size,
                 output_size,
                 dropout):
        super(DecoderRNN, self).__init__()
        self.output_size = output_size
        self.dropout = dropout

        self.embedding = nn.Embedding(output_size, embedding_size)
        self.gru1 = nn.GRU(embedding_size, embedding_size)
        self.gru2 = nn.GRU(embedding_size, embedding_size)
        self.gru3 = nn.GRU(embedding_size, embedding_size)
        self.out = nn.Linear(embedding_size, output_size)
        self.logsoftmax = nn.LogSoftmax(dim=1)

    def forward(self, inp, hidden):
        output = self.embedding(inp).view(1, 1, -1)
        output = F.leaky_relu(output)

        output = F.dropout(output, p=self.dropout)
        output, hidden = self.gru1(output, hidden)

        output = F.dropout(output, p=self.dropout)
        output, hidden = self.gru2(output, hidden)
        output, hidden = self.gru3(output, hidden)

        out = self.out(output[0])
        print("Out: ", out)
        output = self.logsoftmax(out)
        return output, hidden

I am using PyTorchs NLLLoss(). Optimizer is SGD. I call optimizer.zero_grad() right before the backward and optimizer step and I switch the training/evaluation mode for training, evaluation and testing.

What are your thoughts on this?

Thank you very much!

EDIT

Dimensions of the Encoder:

in_channels=301 (This is the size of the initial node embeddings)

hidden_channels=256

out_channels=301 (This will also be the size of the final graph embedding, after mean pooling the node embeddings)

Dimensions of the Decoder:

embedding_size=301 (the size of the previously pooled graph embedding)

output_size=number of words in my vocabulary. In the training above around 1.2K

I am using top-k sampling and my train loop follows the NMT Tutorial https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html#training-the-model). Similarily, my translation function, that takes the data of a single graph, decodes a question as such:

def translate(self, data):
    # Get node embeddings of the input graph
    h = self.encoder(data.node_embeddings,
                     data.edge_index, data.edge_embeddings)

    # Pool node embeddings into single graph embedding
    graph_embedding = self.get_graph_embeddings(h, data.graph_dict)

    # Pass graph embedding through decoder
    self.encoder.eval()
    self.decoder.eval()
    with torch.no_grad():
        # Initialize first input and hidden state
        decoder_input = decoder_input = torch.tensor(
            [[self.vocab.SOS['idx']]], device=self.device)
        decoder_hidden = graph_embedding.view(1, 1, -1)

        decoder_tokens = []
        for di in range(self.dec_max_length):
            decoder_output, decoder_hidden = self.decoder(
                decoder_input, decoder_hidden)
            topv, topi = decoder_output.data.topk(1)
            if topi.item() == self.vocab.EOS['idx']:
                break
            else:
                word = self.vocab.index2word[topi.item()]
                word = word.upper(
                ) if word == self.vocab.UNK['token'].lower() else word
                decoder_tokens.append(word)
            decoder_input = topi.squeeze().detach()

        return decoder_tokens

Also: At times, the output-vector of the final gru layer (self.gru3(...)) inside the forward() function (5th line from the bottom) outputs a lot of values being (close to) 1 and -1. I suppose these might otherwise be a lot higher/lower without clipping. This might be alright, but seems unusual to me. An example:

tensor([[[-0.9984, -0.9950,  1.0000, -0.9889, -1.0000, -0.9770, -0.0299,
          -0.9996,  0.9996,  1.0000, -0.0176, -0.5815, -0.9998, -0.0265,
          -0.1471,  0.9998, -1.0000, -0.2356,  0.9964,  0.9936, -0.9998,
           0.0652, -0.9999,  0.9999, -1.0000, -0.9998, -0.9999,  0.9998,
          -1.0000, -0.9997,  0.9850,  0.9994, -0.9998, -1.0000, -1.0000,
           0.9977,  0.9015, -0.9982,  1.0000,  0.9980, -1.0000,  0.9859,
           0.6670,  0.9998,  0.3827,  0.9999,  0.9953, -0.9989,  0.1287,
           1.0000,  1.0000, -1.0000,  0.9778,  1.0000,  1.0000, -0.9907, ...

score 0 · Accepted Answer · answered Oct 04 '22 at 16:10

0

Your code looks good, and given the training/validation curves you posted, it looks like it's doing alright.

How are you generating text samples? Are you just taking the word the model predicts with the highest probability, appending to the end of your input sequence, and calling forward again? This sampling technique, called greedy sampling, can lead to behavior you described. Maybe another sampling technique could help (see beam search https://medium.com/geekculture/beam-search-decoding-for-text-generation-in-python-9184699f0120)?

answered Oct 04 '22 at 16:10

DWKOT

295
2
8

Hello DWKOT, I saw your comment, thank you! I updated my post to clarify. – DustyAvocado Oct 04 '22 at 16:23
Cool, thanks for the update. Are you doing top-k sampling with k=1? That's functionally the same as greedy sampling (see the line in the translate function that says `topv, topi = decoder_output.data.topk(1)`). Sorry to be so pedantic about this, but greedy sampling does cause nonsensical output. You might want to generate more possible tokens (such as by calling `topk(10)`), and then do a weighted sample or something according to their probabilities. – DWKOT Oct 04 '22 at 16:29
I am/was not really familiar with this sampling method, so I am calling `.topk(1)`, thinking this will return the "most probable item" in the vector. I looked at the doc for the function just now and think my way of calling the function does not make too much sense, does it? I will look into the sampling approach more and report back when I have new results. – DustyAvocado Oct 04 '22 at 16:34
you're right, `.topk(1)` will return the most probable next word. It makes sense to take the most probable word, but it can lead the model to get stuck in nonsense loops, as you observed. here's a good article on sampling techniques https://towardsdatascience.com/how-to-sample-from-language-models-682bceb97277 . Good luck! – DWKOT Oct 04 '22 at 16:36
Alright, thank you for the hint! I will look into that and come back with the results. – DustyAvocado Oct 04 '22 at 16:37
No problem! Let me know if that produces better output. – DWKOT Oct 04 '22 at 23:34
This implementation of Beam Search (https://github.com/jarobyte91/pytorch_beam_search/blob/master/src/pytorch_beam_search/seq2seq/search_algorithms.py) takes the log softmax of the model predictions (line 143). So my Decoder should only return Softmax, not Logsoftmax, right? Therefore, I'd also change my Loss-Function from NLLLoss to CrossEntropyLoss. – DustyAvocado Oct 05 '22 at 11:36
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/248564/discussion-between-dwkot-and-dustyavocado). – DWKOT Oct 05 '22 at 13:18
1

Quick follow-up: Repetitive predictions are indeed gone now. Loss curves are very oscillating, but this is unrelated to this topic and probably due to low training data/time. Thanks for your help! – DustyAvocado Oct 07 '22 at 09:22

Repetitive word predictions in RNN

1 Answers1

Linked