I am trying to create ASR model by myself and learn how to use CTC loss.
I test and I see this:
ctc_loss = nn.CTCLoss(blank=95)
output: tensor([[63, 8, 1, 38, 29, 14, 41, 71, 14, 29, 45, 41, 3]]): torch.Size([1, 13]); output_size: tensor([13])
input1: torch.Size([167, 1, 96]); input1_size: tensor([167])
After applying the argmax
on this input (= prediction of phonems)
torch.argmax(input1, dim=2)
I get a series of symbols:
tensor([[63, 63, 63, 63, 63, 63, 95, 95, 63, 63, 95, 95, 8, 8, 8, 95, 8, 95,
8, 8, 95, 95, 95, 1, 1, 95, 1, 95, 1, 1, 95, 95, 38, 95, 95, 38,
38, 38, 38, 38, 29, 29, 29, 29, 29, 29, 29, 95, 29, 29, 95, 95, 95, 95,
95, 95, 95, 95, 95, 95, 14, 95, 14, 95, 95, 95, 95, 14, 95, 14, 41, 41,
41, 95, 41, 41, 41, 41, 41, 41, 71, 71, 71, 95, 71, 71, 71, 71, 71, 95,
95, 14, 14, 95, 14, 14, 95, 14, 14, 95, 29, 29, 95, 29, 29, 29, 29, 29,
29, 29, 45, 95, 95, 45, 45, 95, 45, 45, 45, 45, 41, 95, 41, 41, 95, 95,
95, 41, 41, 41, 3, 3, 3, 3, 3, 95, 3, 3, 3, 95, 95, 95, 95, 95,
95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95,
95, 95, 95, 95, 95]])
and the following loss values.
ctc_loss(input1, output, input_size, output_size)
# Returns 222.8446
With a different input:
input2: torch.Size([167, 1, 96]) input2_size: tensor([167])
torch.argmax(input2, dim=2)
the prediction is just a sequence of blank symbols.
tensor([[95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95,
95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95,
95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95,
95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95,
95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95,
95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95,
95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95,
95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95,
95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95,
95, 95, 95, 95, 95]])
However, the loss value with the same desired output is much lower.
ctc_loss(input2, output, input_size, output_size)
# Returns 3.7955
I don't know why input1
is better than input2
but the loss of input1
is higher than input2
? Can someone explain that?