Let's say I'm trying to train an RNN language model in PyTorch. Suppose I iterate over batches of word sequences, and that each training batch tensor has the following shape:
data.shape = [batch_size, sequence_length, vocab_dim]
My question is, what's the difference between using only the last word in each sequence as the target label:
X = data[:,:-1]
y = data[:,-1]
and training to minimize loss using a softmax prediction of the last word,
vs setting the target to be the entire sequence shifted right:
X = data[:,:-1]
y = data[:,1:]
and training to minimize the sum of losses of each predicted word in the shifted sequence?
What's the correct approach here? I feel like i've seen both examples online. Does this also have to do with loop unrolling vs BPTT?