How bert [cls] can collect the relevant information from the rest of the hidden states.??. Does [cls] has mlm information? If i train my bert using only mlm, in this case cls works?
Asked
Active
Viewed 57 times
1 Answers
2
The self-attention takes care of that, what self-attention does is basically a clever collection and combination of information from the hidden states at the previous network layer.
Nevertheless, something needs to "tell" the [CLS]
vector to collect the information about the rest of the sentence. When only using the masked-language-model objective, the [CLS]
vector does not play any special role. However, as the RoBERTa paper shows, it does not need a special pertaining, and using the vector during fine-tuning provides enough training signal.

Jindřich
- 10,270
- 2
- 23
- 44