Hello dear community,
I am training a Seq2Seq model to generate a question based on a graph. Both train and val loss are converging, but the generated questions (on either train or test set) are nonsense and contain mostly repetition of tokens. I tried various hyper parameters and double checked input and outputs tensors.
Something that I do find odd is that the output out
(see below) starts containing some values, which I consider as unusually high. This starts happening around half way through the first epoch:
Out: tensor([[ 0.2016, 103.7198, 90.4739, ..., 0.9419, 0.4810, -0.2869]]
My guess for that is vanishing/exploding gradients, which I thought I had handeled by gradient clipping, but now I am not sure about this:
for p in model_params:
p.register_hook(lambda grad: torch.clamp(
grad, -clip_value, clip_value))
Below are the training curves (10K samples, batch size=128, lr=0.065, lr_decay=0.99, dropout=0.25)
Encoder (a GNN, learning node embeddings of the input graph, that consists of around 3-4 nodes and edges. A single graph embedding is obtained by pooling the node embeddings and feeding them as the initial hidden state to the Decoder):
class QuestionGraphGNN(torch.nn.Module):
def __init__(self,
in_channels,
hidden_channels,
out_channels,
dropout,
aggr='mean'):
super(QuestionGraphGNN, self).__init__()
nn1 = torch.nn.Sequential(
torch.nn.Linear(in_channels, hidden_channels),
torch.nn.ReLU(),
torch.nn.Linear(hidden_channels, in_channels * hidden_channels))
self.conv = NNConv(in_channels, hidden_channels, nn1, aggr=aggr)
self.lin = nn.Linear(hidden_channels, out_channels)
self.dropout = dropout
def forward(self, x, edge_index, edge_attr):
x = self.conv(x, edge_index, edge_attr)
x = F.leaky_relu(x)
x = F.dropout(x, p=self.dropout)
x = self.lin(x)
return x
Decoder (The out
vector from above is printed in the forward() function):
class DecoderRNN(nn.Module):
def __init__(self,
embedding_size,
output_size,
dropout):
super(DecoderRNN, self).__init__()
self.output_size = output_size
self.dropout = dropout
self.embedding = nn.Embedding(output_size, embedding_size)
self.gru1 = nn.GRU(embedding_size, embedding_size)
self.gru2 = nn.GRU(embedding_size, embedding_size)
self.gru3 = nn.GRU(embedding_size, embedding_size)
self.out = nn.Linear(embedding_size, output_size)
self.logsoftmax = nn.LogSoftmax(dim=1)
def forward(self, inp, hidden):
output = self.embedding(inp).view(1, 1, -1)
output = F.leaky_relu(output)
output = F.dropout(output, p=self.dropout)
output, hidden = self.gru1(output, hidden)
output = F.dropout(output, p=self.dropout)
output, hidden = self.gru2(output, hidden)
output, hidden = self.gru3(output, hidden)
out = self.out(output[0])
print("Out: ", out)
output = self.logsoftmax(out)
return output, hidden
I am using PyTorchs NLLLoss()
.
Optimizer is SGD.
I call optimizer.zero_grad()
right before the backward and optimizer step and I switch the training/evaluation mode for training, evaluation and testing.
What are your thoughts on this?
Thank you very much!
EDIT
Dimensions of the Encoder:
in_channels
=301 (This is the size of the initial node embeddings)
hidden_channels
=256
out_channels
=301 (This will also be the size of the final graph embedding, after mean pooling the node embeddings)
Dimensions of the Decoder:
embedding_size
=301 (the size of the previously pooled graph embedding)
output_size
=number of words in my vocabulary. In the training above around 1.2K
I am using top-k sampling and my train loop follows the NMT Tutorial https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html#training-the-model). Similarily, my translation function, that takes the data of a single graph, decodes a question as such:
def translate(self, data):
# Get node embeddings of the input graph
h = self.encoder(data.node_embeddings,
data.edge_index, data.edge_embeddings)
# Pool node embeddings into single graph embedding
graph_embedding = self.get_graph_embeddings(h, data.graph_dict)
# Pass graph embedding through decoder
self.encoder.eval()
self.decoder.eval()
with torch.no_grad():
# Initialize first input and hidden state
decoder_input = decoder_input = torch.tensor(
[[self.vocab.SOS['idx']]], device=self.device)
decoder_hidden = graph_embedding.view(1, 1, -1)
decoder_tokens = []
for di in range(self.dec_max_length):
decoder_output, decoder_hidden = self.decoder(
decoder_input, decoder_hidden)
topv, topi = decoder_output.data.topk(1)
if topi.item() == self.vocab.EOS['idx']:
break
else:
word = self.vocab.index2word[topi.item()]
word = word.upper(
) if word == self.vocab.UNK['token'].lower() else word
decoder_tokens.append(word)
decoder_input = topi.squeeze().detach()
return decoder_tokens
Also: At times, the output
-vector of the final gru layer (self.gru3(...)
) inside the forward() function (5th line from the bottom) outputs a lot of values being (close to) 1 and -1. I suppose these might otherwise be a lot higher/lower without clipping. This might be alright, but seems unusual to me. An example:
tensor([[[-0.9984, -0.9950, 1.0000, -0.9889, -1.0000, -0.9770, -0.0299,
-0.9996, 0.9996, 1.0000, -0.0176, -0.5815, -0.9998, -0.0265,
-0.1471, 0.9998, -1.0000, -0.2356, 0.9964, 0.9936, -0.9998,
0.0652, -0.9999, 0.9999, -1.0000, -0.9998, -0.9999, 0.9998,
-1.0000, -0.9997, 0.9850, 0.9994, -0.9998, -1.0000, -1.0000,
0.9977, 0.9015, -0.9982, 1.0000, 0.9980, -1.0000, 0.9859,
0.6670, 0.9998, 0.3827, 0.9999, 0.9953, -0.9989, 0.1287,
1.0000, 1.0000, -1.0000, 0.9778, 1.0000, 1.0000, -0.9907, ...