0

Transformers (BERT) use one set of three matrices, Q, K, V for each attention head. BERT uses 12 attention heads in each layer, with each attention head having it's own set of three such matrices.

The actual values of these 36 matrices are obtained via the training.

My question is: How does the model ensure that it doesn't end up with 12 sets of identical matrices? The only real difference between the attention heads seems to be the random initialization. It is very possible of course that the random initialization combined with the gradient descent training does ensure that the attention heads don't end up being identical. But this is not guaranteed, right?

Is there any other factor ensuring that the attention heads don't become identical over the course of the training?

This part is not clear from any of the papers I have seen online.

gauss
  • 9
  • 2

0 Answers0