1

I am computing the different terms of the ELBO and its expectations to illustrate and get a better grasp of the reparametrization trick, as nicely explained here under undifferentiable expectations.

As a simplified example in this journey, I have a random variable $z$ distributed as follows

$$z \sim \mathcal{N}(\mu=0, \sigma=t)$$

where $t$ is a parameter that I want to optimize (like MLE). I take one sample $z_i$ from the distribution and I want to compute the gradient of the probability of $z_i$ with respect to $t$, at some given value of $t=t_{\text{value}}$

$$\nabla_t p_z(z_i)$$

I use a gradient tape for this and a tfd.Normal object to compute the pdf. I must build the object within the gradient tape so that and I can optimize with respect to $t$ upon which the distribution depends. Therefore I must sample within the gradient tape.

Reparametrization trick aside, when I sample from the same distribution object I get a different gradient as compared when I sample from an equivalent scipy.stats object (therefore, not related to the computational graph).

In fact, the gradients computed in the second case (using scipy.stats) correspond to the ones computed with sympy differentiation. And there is definitely a relation between the gradients obtained.

Note that I am just computing the gradients over a single data item each time, not computing expectations.

Clearly, there is some extra dependency introduced in the computational graph by sampling and this affects the gradients. Is this something expected, or not, or am I just doing something too weird?

import sympy as sy
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

import tensorflow_probability as tfp
import tensorflow as tf
tfd =  tfp.distributions

def sample_gradients(sample_from_tf=True, tvalue=1.5):
  tvar = tf.Variable(tvalue)

  grads, samples = [], []

  # do 100 times to paint the gradients later
  for _ in range(100):
      with tf.GradientTape() as tape:
          dz = tfd.Normal(loc=0, scale=tvar)
          if sample_from_tf:
              sample = dz.sample(1)
          else:
              sample = stats.norm(loc=0, scale=tvar.numpy()).rvs(1)
          
          loss = dz.prob(sample)

      grad = tape.gradient(loss, tvar)
      grads.append(grad.numpy())
      samples.append(sample[0])
      
  return grads, samples

# prepare for computing grads of pdf with sympy
t,z = sy.symbols(r't z')
ptz = 1/(t*sy.sqrt(2*sy.pi))*sy.exp(-(z/t)**2/2) # gaussian pdf in sympy

# a chosen value for t
tvalue = 1.5

# compute gradients sampling from the same tfd distribution 
# and plot them compared to sympy
grads, samples = sample_gradients(sample_from_tf=True)
sgrads = [ptz.diff(t).subs({t: tvalue, z: zi}).n() for zi in samples]

plt.figure()
plt.scatter(sgrads, grads)
plt.title(r"sampling from the same tfd object $\rightarrow$ DIFFERENT");
plt.xlabel("symbolic gradients"); plt.ylabel("gradients from TF")

# compute gradients sampling from the equivalen distribution in scipy.stats 
# and plot them compared to sympy
grads, samples = sample_gradients(sample_from_tf=False)
sgrads = [ptz.diff(t).subs({t: tvalue, z: zi}).n() for zi in samples]

plt.figure()
plt.scatter(grads, sgrads)
plt.title(r"sampling from scipy.stats $\rightarrow$  EQUAL");
plt.xlabel("symbolic gradients"); plt.ylabel("gradients from TF")
rramosp
  • 11
  • 1

0 Answers0