Why passing a sequence of tokens, say ["A", "B", "C", "D"]
through a masked language model without any masking does not result in the same sequence being output when you select the highest probability tokens from the output model logits, i.e., tokenizer.decode(softmax(logits).argmax()) ≠ ["A", "B", "C", "D"]
?
Is this just a byproduct of the fact that the model may not have converged yet? I understand that when a model has just been initialized the embeddings are random and have nothing to do with the semantics of the vocab but after a few epochs I would expect the model to be able to retrieve the unmasked input sequence perfectly.