Masked language modeling - masked token/embedding clarification

Asked Apr 28 '23 at 22:45

Active Apr 28 '23 at 22:45

Viewed 84 times

When Transformers are trained with masked language modeling (or masked image modelling), the input embeddings at masked positions are replaced with a MASK token/learnable mask embedding. I'm wondering how these mask embeddings work - how does the model realize that, when it sees that embedding, it should 'fill in' the representation using the surrounding context, instead of (for example) predicting the average embedding -> mode collapse? Would this still work if you masked with zeros instead of this mask token during training?

Additional question: Could you mask out part of an input embedding (e.g., set ⅔ of features to zero) during training and still learn to estimate representations of the full input based on the partial input and its context?

asked Apr 28 '23 at 22:45

clueless

Masked language modeling - masked token/embedding clarification

0 Answers0