0

How bert [cls] can collect the relevant information from the rest of the hidden states.??. Does [cls] has mlm information? If i train my bert using only mlm, in this case cls works?

kowser66
  • 125
  • 1
  • 8

1 Answers1

2

The self-attention takes care of that, what self-attention does is basically a clever collection and combination of information from the hidden states at the previous network layer.

Nevertheless, something needs to "tell" the [CLS] vector to collect the information about the rest of the sentence. When only using the masked-language-model objective, the [CLS] vector does not play any special role. However, as the RoBERTa paper shows, it does not need a special pertaining, and using the vector during fine-tuning provides enough training signal.

Jindřich
  • 10,270
  • 2
  • 23
  • 44