Lets say I want to manually calculate the gradient update with respect to the Kullback-Liebler divergence loss, say on a VAE (see an actual example from pytorch sample documentation here):
KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
where the logvar is (for simplicitys sake, ignoring activation functions and multiple layers etc.) basically a single layer transformation from a 400 dim feature vector into a 20 dim one:
self.fc21 = nn.Linear(400, 20)
logvar = fc21(x)
I'm just not mathematically understanding how you take the gradient of this, with respect to the weight vector for fc21. Mathematically I thought this would look like:
KL = -.5sum(1 + Wx + b - m^2 - e^{Wx + b})
dKL/dW = -.5 (x - e^{Wx + b}x)
where W is the weight matrix of the fc21 layer. But here this result isn't in the same shape as W (20x400). Like, x is just a 400 feature vector. So how would I perform SGD on this? Does x just broadcast to the second term, and if so why? I feel like I'm just missing some mathematical understanding here...