Predicting next word using RNN in DL4J - What should 'labels' be?

Question

I'm trying to create an RNN that would predict the next word, given the previous word. But I'm struggling with modeling this into a dataset, specifically, how to indicate the next word to be predicted as a 'label'.

I could use a hot label encoded vector for each word in the vocabulary, but a) it'll have tens of thousands of dimensions, given the large vocabulary, b) I'll lose all the other info contained in the word vector. Perhaps that info would be useful in calculating the error, i.e how off the predictions were from the actual word.

What should I do? Should I just use the one hot encoded vector?

What have you tried ? I guess the output of your network would be a probability distribution over the vocabulary (so you do `argmax` to get the most probable). The label could be a one-hot vector, or the index of the word in the vocabulary. Why are you afraid to "lose all the other info contained in the word vector" ? — ygorg, Feb 25 '21 at 15:58
@ygorg So far I've just been looking over docs / examples. My vocab is huge. Would you still say index / one hot vector is the best option? Re losing info - i thought error is calculated by comparing prediction vs true value. If true value is a one hot vector, perhaps info will be lost vs if true value was the actual word vector. — Ali, Feb 25 '21 at 16:06
Well i'm not familiar with dl4j, but word embeddings is a matrix (voc_size, embed_size), the input of that is generally an index (so an int). Yes, error is computed by comparing prediction vs label. The prediction is generally an index of the vocabulary, and the label also an index of the vocabulary. What is "the actual word vector.". In my mind the prediction or the true value cannot be a word embedding vector, it can only be an index of the vocabulary. — ygorg, Feb 25 '21 at 16:12
@ygorg i'm using the fasttext word vectors where each vector has 300 dimensions. Hmm, if prediction / label is just an index, is there enough info there to learn from? E.g a single number doesn't contain much info about how far off the prediction was? — Ali, Feb 25 '21 at 16:18
Well sorry, yes the prediction is a probability distribution, so you can compare how the probability is different from 1. I think you should follow some tutorial on how to train RNN and mostly on the loss function that are used (negative log likelihood loss) — ygorg, Feb 25 '21 at 16:25

Predicting next word using RNN in DL4J - What should 'labels' be?

0 Answers0