I am trying to perform sequence classification using a custom implementation of a transformer encoder layer. I have been following this tutorial pretty faithfully: tutorial.
The tutorial, however, does not demonstrate an example of using this model to classify a whole sequence. After a little bit of searching, I have come up with the following training function:
class Pred(TransformerPred):
def _get_loss(self, batch, mode='train'):
inp_data, labels = batch
preds = self.forward(inp_data, pos_enc=True)
preds = torch.mean(preds, dim=1)
loss = F.cross_entropy(preds, labels[:, 0])
acc = (preds.argmax(dim=-1) == labels[:, 0]).float().mean()
return loss, acc
def training_step(self, batch, batch_idx):
loss, _ = self._get_loss(batch, mode='train')
return loss
where
inp_data.size() => torch.Size([4, 371, 1])
labels.size() => torch.Size([4, 2])
preds.size() => torch.Size([4, 371, 2])
Currently I am performing binary classification, so in this small example the batch size is 4, the sequence length is 371 and the classes are 2. The labels are one hot encoded. Meaning: [1, 0] for class 0 and [0, 1] for class 1. My input has an embedding dimension of 1. I have read that F.cross_entropy loss is not necessarily the best idea for binary classification, but I am planning to extend this to add a few more classes, so I want it to be generic.
My question is, since the encoder is outputting a value per sequence input per class, I read that averaging those values in the dimension of the sequence could be useful when trying to do classification of the whole sequence.
What I observe, however, when training are values like: tensor([[ 0.0863, -0.1591],[-0.1827, -0.4415], [-0.0477, -0.2966],[-0.1693, -0.4047]])
, i.e negative values and class 0 always having a higher value. Is there something wrong with this approach? I am not sure I understand how F.cross_entropy works and how I should use the transformer encoder to perform classification of a whole sequence.