Hello I am implementing a lstm for language modelling for homework and I am at the loss implementation phase. Our instructor told us to use F.nll_loss but the sequences are padded and we have to take into account a mask that is given which tells us when the sequences stop.
input:
- log_probas (batch_size, sequence_length(padded), vocabulary size)
- targets (batch_size, sequence_length(padded))
- mask (batch_size, sequence_length(padded)
naive implementation which works without taking into account the mask:
import torch.nn.functional as F
loss = F.nll_loss(log_probas.transpose(1, 2), targets)
I've been crawling the internet and banging my head but can't seem to find an answer on how to incorporate a mask into the averaging scheme of the loss.