0

I am trying to perform sequence classification using a custom implementation of a transformer encoder layer. I have been following this tutorial pretty faithfully: tutorial.

The tutorial, however, does not demonstrate an example of using this model to classify a whole sequence. After a little bit of searching, I have come up with the following training function:

class Pred(TransformerPred):
    def _get_loss(self, batch, mode='train'):
       inp_data, labels = batch
       preds            = self.forward(inp_data, pos_enc=True)
       preds            = torch.mean(preds, dim=1)
       loss             = F.cross_entropy(preds, labels[:, 0])
       acc              = (preds.argmax(dim=-1) == labels[:, 0]).float().mean()
       return loss, acc

   def training_step(self, batch, batch_idx):
       loss, _ = self._get_loss(batch, mode='train')
       return loss

where

inp_data.size() => torch.Size([4, 371, 1])

labels.size() => torch.Size([4, 2])

preds.size() => torch.Size([4, 371, 2])

Currently I am performing binary classification, so in this small example the batch size is 4, the sequence length is 371 and the classes are 2. The labels are one hot encoded. Meaning: [1, 0] for class 0 and [0, 1] for class 1. My input has an embedding dimension of 1. I have read that F.cross_entropy loss is not necessarily the best idea for binary classification, but I am planning to extend this to add a few more classes, so I want it to be generic.

My question is, since the encoder is outputting a value per sequence input per class, I read that averaging those values in the dimension of the sequence could be useful when trying to do classification of the whole sequence.

What I observe, however, when training are values like: tensor([[ 0.0863, -0.1591],[-0.1827, -0.4415], [-0.0477, -0.2966],[-0.1693, -0.4047]]), i.e negative values and class 0 always having a higher value. Is there something wrong with this approach? I am not sure I understand how F.cross_entropy works and how I should use the transformer encoder to perform classification of a whole sequence.

Desperados
  • 434
  • 5
  • 13
  • 1
    `loss = F.cross_entropy(preds, labels[:, 0])` shouldn't this be `loss = F.cross_entropy(preds, labels[:, 1])` so that class 1 stays as class 1? – jhso May 04 '22 at 00:10
  • Also, negative outputs aren't bad. You just need to use a sigmoid or other activation. Your output class will just be `preds[:,1] > preds[:,0]`. For binary classification i would usually recommend having a single output neuron with a threshold applied. – jhso May 04 '22 at 00:12
  • @jhso I through F.cross_entropy applied a softmax by default, so I did not add any activation at the end of my network. But, now again I am wondering whether this makes sense, given that I average the values in the dimension of the sequence length. – Desperados May 04 '22 at 18:20
  • 1
    You're right, the loss function will have a softmax applied internally, but the outputs from your model when you call `preds = self.forward(...)` will be raw logits. So you need to apply softmax to them as the output of your eval function (your `_get_loss` function is fine as-is except for the change i mentioned above). – jhso May 04 '22 at 23:38
  • @jhso So applying a softmax before I average the logits? Would that not result in 2xsoftmax layers being applied (since cross_entropy will also do it internall). Isn't that a problem ? – Desperados May 05 '22 at 21:46
  • this is only evaluation mode that we're talking about. You have two paths: in training you get the predictions, take the mean, and then calculate the loss. In evaluation, you get the predictions, take the mean, and then do the softmax. If you're just doing `pred[:,1]>pred[:,0]` then softmax won't affect your accuracy metric. – jhso May 06 '22 at 00:41

0 Answers0