Greeting,
I am currently trying to develop a scene text-recognition model with a CNN (which might change to a pre-trained ResNet in the future) backbone and a LSTM for the predictions.
Skipping all the unnecessary info, my targets are in shape [16, 30] which represents the max length of a sequence and contains the labels (0-9 and a-z) represented in the interval of [1-36] with 37 being reserved for the END-TOKEN which I inserted on the end of each sequence right before the padding starts. The rest of the target tensor which does not contain any characters is padded up to max_length (30) with 38s. Preds are the indices of the max and the result of the operation:
_, preds = torch.max(outputs, 2)
Where outputs are the LSTM outputs in the shape of [16, 30, 38]. I am using CTCLoss so it automaticaly ignores the padded part of the sequence. The question is though how I can do the same with accuracy?
Additional info
- batch_size = 16
- max_seq_len = 30
- vocabilary_size = 38 (with END-TOKEN and CTCLoss 'blank' character)
What I have tried is create an ignore-mask as have seen here (changed appropriately for it to work with Pytorch)
import torch
class alt_accuracy:
def __init__(self, pad_token = 38):
self.pad_token = pad_token
def ignore_pad_accuracy(self, preds, targets):
preds_class = preds
targets_class = targets
ignore_mask = (~torch.eq(preds_class, self.pad_token)).type(torch.IntTensor)
matches = (torch.eq(targets_class, preds_class)).type(torch.IntTensor) * ignore_mask
if torch.sum(matches) == 0:
ignore_accuracy = torch.tensor(0)
else:
ignore_accuracy = torch.sum(matches) / torch.sum(ignore_mask)
return ignore_accuracy
The problem is this mask (if i understand it correctly) expects from the model to predict the padds which mine cannot thanks to using CTCLoss and its innate characteristic of not wanting the blank labels on the targets as seen here so the ignore_mask
from start to finish always has a tensor filled with 1s which indicates it never finds any padding
I would appreciate any help with the problem.