How does BERT loss function works?

Question

I'm confused about how cross-entropy works in bert LM. To calculate loss function we need the truth labels of masks. But we don't have the vector representation of the truth labels and the predictions are vector representations. So how to calculate loss ?

This is not how BERT works, and you are asking in the wrong site, this is not a Machine Learning site. — Dr. Snoopy, Jun 16 '22 at 07:05

score 0 · Answer 1 · answered Jul 13 '22 at 04:41

We already know the words we mask before passing to BERT so the actual word's one hot encoding is the actual truth label. The predicted token of masked word is passed to a softmax layer which converts the masked word's vector into another embedding (size will be similar to input word vector's size). Then we can calculate cross entropy loss between the input vector and the one we got after softmax layer. Hope this clarifies. For better clarification watch this https://www.youtube.com/watch?v=xI0HHN5XKDo

How does BERT loss function works?

1 Answers1