I am trying to figure out how to do backpropagation through the scaled dot product attention model. The scaled dot production attention takes Q(Queries),K(Keys),V(Values) as inputs and performs the following operation:
Attention(Q,K,V ) = softmax((Q.transpose(K))/√dk )V
Here √dk is the scaling factor and is a constant.
Here Q,K and V are tensors. I am for now assuming that Q=K=V. So I differentiate the formula (softmax((Q.transpose(Q)))Q) with respect to Q. I think the answer would be:
softmax((Q.transpose(Q))) + Q.derivativeOfSoftmax((Q.transpose(Q))).(2*transpose(Q))
Since I think the derivative of Q.transpose(Q) wrt Q is 2*Q.transpose(Q).
Is this the right approach considering the rules of tensor calculus? If not kindly tell me how to proceed.
One can refer the concept of scaled dot product attention in the given paper: https://arxiv.org/pdf/1706.03762.pdf