0

Lets say I want to manually calculate the gradient update with respect to the Kullback-Liebler divergence loss, say on a VAE (see an actual example from pytorch sample documentation here):

KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

where the logvar is (for simplicitys sake, ignoring activation functions and multiple layers etc.) basically a single layer transformation from a 400 dim feature vector into a 20 dim one:

self.fc21 = nn.Linear(400, 20)
logvar = fc21(x)

I'm just not mathematically understanding how you take the gradient of this, with respect to the weight vector for fc21. Mathematically I thought this would look like:

KL = -.5sum(1 + Wx + b - m^2 - e^{Wx + b})

dKL/dW = -.5 (x - e^{Wx + b}x)

where W is the weight matrix of the fc21 layer. But here this result isn't in the same shape as W (20x400). Like, x is just a 400 feature vector. So how would I perform SGD on this? Does x just broadcast to the second term, and if so why? I feel like I'm just missing some mathematical understanding here...

Community
  • 1
  • 1
Matt
  • 1,599
  • 3
  • 21
  • 33

1 Answers1

1

Let's simplify the example a bit and assume a fully connected layer of input shape 3 and output shape 2, then:

W = [[w1, w2, w3], [w4, w5, w6]]
x = [x1, x2, x3]
y = [w1*x1 + w2*x2 + w3*x3, w4*x1 + w5*x2 + w6*x3]
D_KL = -0.5 * [ 1 + w1*x1 + w2*x2 + w3*x3 + w4*x1 + w5*x2 + w6*x3 + b - m^2 + e^(..)] 
grad(D_KL, w1) = -0.5 * [x1 + x1* e^(..)]
grad(D_KL, w2) = -0.5 * [x2 + x2* e^(..)]
...
grad(D_KL, W) = [[grad(D_KL, w1), grad(D_KL, w2), grad(D_KL,w3)], 
                 [grad(D_KL, w4), grad(D_KL, w5), grad(D_KL,w6)]
                ]

This generalizes for higher order tensors of any dimensionality. Your differentiation is wrong in treating x and W as scalars rather than taking element-wise partial derivatives.

KonstantinosKokos
  • 3,369
  • 1
  • 11
  • 21