When Transformers are trained with masked language modeling (or masked image modelling), the input embeddings at masked positions are replaced with a MASK token/learnable mask embedding. I'm wondering how these mask embeddings work - how does the model realize that, when it sees that embedding, it should 'fill in' the representation using the surrounding context, instead of (for example) predicting the average embedding -> mode collapse? Would this still work if you masked with zeros instead of this mask token during training?
Additional question: Could you mask out part of an input embedding (e.g., set ⅔ of features to zero) during training and still learn to estimate representations of the full input based on the partial input and its context?