5

I am trying to use Hamiltonian Monte Carlo (HMC, from Tensorflow Probability) but my target distribution contains an intractable 1-D integral which I approximate with the trapezoidal rule. My understanding of HMC is that it calculates gradients of the target distribution to build a more efficient transition kernel. My question is can Tensorflow work out gradients in terms of the parameters of function, and are they meaningful?

For example this is a log-probability of the target distribution where 'A' is a model parameter:

# integrate e^At * f[t] with respect to t between 0 and t, for all t

t = tf.linspace(0., 10., 100)
f = tf.ones(100)
delta = t[1]-t[0]
sum_term = tfm.multiply(tfm.exp(A*t), f)
integrals = 0.5*delta*tfm.cumsum(sum_term[:-1] + sum_term[1:], axis=0) 
pred = integrals
sq_diff = tfm.square(observed_data - pred)
sq_diff = tf.reduce_sum(sq_diff, axis=0)
log_lik = -0.5*tfm.log(2*PI*variance) - 0.5*sq_diff/variance
return log_lik

Are the gradients of this function in terms of A meaningful?

Cobbles
  • 1,748
  • 17
  • 33
  • 1
    It's a bit of a strange question, do you mean if you can compute gradients? Sure, as long as all operations are differentiable (which in your example should be). You can try that yourself. But what gradients with respect to what you want to compute? How "meaningful" it is depends on that, on what you do with the computed gradient and on your goals. – jdehesa May 26 '20 at 09:58
  • What are your variable, with respect to which you take the gradients? – Aramakus May 26 '20 at 09:59
  • @jdehesa I've updated the question with more information. The goals are to perform Hamiltonian Monte Carlo which uses gradients. The code I posted is part of the target probability. I'd like to know if the gradients are actually meaningful when an approximation is used -- my understanding of how automatic differentiation works is limited – Cobbles May 26 '20 at 10:41
  • Agreed with jdehesa, all that matters is the ops involved are supported by TensorFlow's autodifferentiation - and it has lots of such ops, likely sufficient for your needs. It's then a question of using those ops to build the forward pass that yields the correct backward pass. This can be complicated by a need to tell TensorFlow to explicitly "watch" certain tensors it won't automatically (`GradientTape().watch()`), or to tell it to NOT differentiate via `tf.stop_gradient`. -- As for RNG gradients, unsure, but TF does train variational autoencoders, which include WGN in forward pass. – OverLordGoldDragon May 26 '20 at 10:42
  • @Aramakus I take gradients with respect to the parameters, which in the example consist of just 'A' – Cobbles May 27 '20 at 09:33

1 Answers1

3

Yes, you can use tensorflow GradientTape to work out the gradients. I assume you have a mathematical function outputting log_lik with many inputs, one of it is A

GradientTape to get the gradient of A

The get the gradients of log_lik with respect to A, you can use the tf.GradientTape in tensorflow

For example:

with tf.GradientTape(persistent=True) as g:
  g.watch(A)

  t = tf.linspace(0., 10., 100)
  f = tf.ones(100)
  delta = t[1]-t[0]
  sum_term = tfm.multiply(tfm.exp(A*t), f)
  integrals = 0.5*delta*tfm.cumsum(sum_term[:-1] + sum_term[1:], axis=0) 
  pred = integrals
  sq_diff = tfm.square(observed_data - pred)
  sq_diff = tf.reduce_sum(sq_diff, axis=0)
  log_lik = -0.5*tfm.log(2*PI*variance) - 0.5*sq_diff/variance
  z = log_lik

## then, you can get the gradients of log_lik with respect to A like this
dz_dA = g.gradient(z, A)

dz_dA contains all partially derivatives of variables in A

I just show you the idea by the code above. In order to make it works you need to do the calculation by Tensor operation. So change to modify your function to use tensor type for the calculation

Another example but in tensor operation

x = tf.constant(3.0)
with tf.GradientTape() as g:
  g.watch(x)
  with tf.GradientTape() as gg:
    gg.watch(x)
    y = x * x
  dy_dx = gg.gradient(y, x)     # Will compute to 6.0
d2y_dx2 = g.gradient(dy_dx, x)  # Will compute to 2.0

Here you can see more example from the document to understand more https://www.tensorflow.org/api_docs/python/tf/GradientTape

Further discussion on "meaningfulness"

Let me translate the python code to mathematics first (I use https://www.codecogs.com/latex/eqneditor.php, hope it can display properly):

# integrate e^At * f[t] with respect to t between 0 and t, for all t

From above, it means you have a function. I call it g(t, A)

Then you are doing a definite integral. I call it G(t,A)

From your code, t is not variable any more, it is set to 10. So, we reduce to a function that has only one variable h(A)

Up to here, function h has a definite integral inside. But since you are approximating it, we should not think it as a real integral (dt -> 0), it is just another chain of simple maths. No mystery here.

Then, the last output log_lik, which is simply some simple mathematical operations with one new input variable observed_data, I call it y.

Then a function z that compute log_lik is:

z is no different than other normal chain of maths operations in tensorflow. Therefore, dz_dA is meaningful in the sense that the gradient of z w.r.t A gives you the gradient to update A that you can minimize z

palazzo train
  • 3,229
  • 1
  • 19
  • 40
  • 1
    I understand the concept of GradientTape; I'm asking specifically about whether it makes sense to get the partial derivative of a variable used within the integral approximation? – Cobbles May 28 '20 at 10:30
  • I see now. I have updated my answer. See if it can answer your question – palazzo train May 29 '20 at 02:22
  • Thanks. What helped me get over the final hurdle is that the partial derivative operation commutes into the integral/summation approx. Therefore the integral operator is indeed no mystery when finding the partial derivative w.r.t. A. – Cobbles Jun 02 '20 at 10:55