0

I am trying to figure out how to do backpropagation through the scaled dot product attention model. The scaled dot production attention takes Q(Queries),K(Keys),V(Values) as inputs and performs the following operation:

Attention(Q,K,V ) = softmax((Q.transpose(K))/√dk )V

Here √dk is the scaling factor and is a constant.

Here Q,K and V are tensors. I am for now assuming that Q=K=V. So I differentiate the formula (softmax((Q.transpose(Q)))Q) with respect to Q. I think the answer would be:

softmax((Q.transpose(Q))) + Q.derivativeOfSoftmax((Q.transpose(Q))).(2*transpose(Q))

Since I think the derivative of Q.transpose(Q) wrt Q is 2*Q.transpose(Q).

Is this the right approach considering the rules of tensor calculus? If not kindly tell me how to proceed.

One can refer the concept of scaled dot product attention in the given paper: https://arxiv.org/pdf/1706.03762.pdf

cherry13
  • 11
  • 3

1 Answers1

0

I'm not sure tensor calculus is the right term.

Choose a specific index of your vector, say index j. Then differentiate with respect to that variable. Do that for index 1, 2, 3, etc. and you will see a pattern. Let me give an example with multiplication. There are two types of multiplication with matrices, matrix multiplication & the hadamard product. The hadamard product is the intuitive method where you multiply two same dimension matrices element-wise. In a similar manner, you should differentiate your softmax function "element-wise".

Rehaan Ahmad
  • 794
  • 8
  • 23