I went through your code and found out that in the error trace of yours (error in forward
call of SentenceEmbedding
, encoder
stage)
69 def forward(self, x, start_token, end_token): # sentence
70 x = self.batch_tokenize(x, start_token, end_token)
71 ---> x = self.embedding(x)
72 pos = self.position_encoder().to(get_device())
73 x = self.dropout(x + pos)
If you add print(torch.max(x))
before the line x = self.embedding(x)
Then you can see that the error is because x
contains id
that is >=68. If the value is greater than 68, then Pytorch will raise the error mentioned in the stack trace.
It means that while you are converting tokens
to ids
, you are assigning a value greater than 68.
To prove my point:
when you are creating english_to_index
, since there are three ""
in your english_vocabulary
(START_TOKEN
, PADDING_TOKEN
, END_TOKEN
are all ""
) you end up generating { "": 69 }
. Since this value is greater than the len(english_to_index) # length = 68
.
Hence, you are getting IndexError: index out of range in self
Solution
As a solution, you can give unique tags to these tokens (which is generally prescribed) as:
START_TOKEN = "START"
PADDING_TOKEN = "PAD"
END_TOKEN = "END"
This will make sure that the generated dictionaries will have the correct sizes.
Please find the working Google Colaboratory file here with the solution section.
I added '\\'
to the english_vocabulary
since after a few iterations we get a KeyError: '\\'
.
Hope it helps.