Can I use multiple softmax in the last output layer in transformers? If so, how can I calculate loss from that. I am working in pytorch.
And I am asking because my data is a sequence of tuples where, the elements have different dimensions. Like,
[(2,1), (3,1), (3,1), (2,1), (2,1), (3,1), (3,0), (4,1)]
The first element of tuples has a vocab of 5
and the second element of tuples has a vocab of 2
.