0

Why passing a sequence of tokens, say ["A", "B", "C", "D"] through a masked language model without any masking does not result in the same sequence being output when you select the highest probability tokens from the output model logits, i.e., tokenizer.decode(softmax(logits).argmax()) ≠ ["A", "B", "C", "D"]?

Is this just a byproduct of the fact that the model may not have converged yet? I understand that when a model has just been initialized the embeddings are random and have nothing to do with the semantics of the vocab but after a few epochs I would expect the model to be able to retrieve the unmasked input sequence perfectly.

Anshul
  • 61
  • 1
  • 8

0 Answers0