Why does a GPflow model not seem to learn anything with TensorFlow optimizers such as tf.optimizers.Adam?

Question

My inducing points are set to trainable but do not change when I call opt.minimize(). Why is it and what does it mean? Does it mean, the model is not learning? What is the difference between tf.optimizers.Adam(lr) and gpflow.optimizers.Scipy?

The following is the simple classification example adapted from the documentation. When I run this code example with gpflow's Scipy optimizer then I get the trained results and the values for inducing variables keep changing. But when I use Adam optimizer then I get only a straight line prediction, and the values for inducing points remain the same. It indicates that the model is not learning with Adam optimizer.

plot of data before training

plot of data after training with Adam

plot of data after training with gpflow optimizer Scipy

The link for the example is https://gpflow.readthedocs.io/en/develop/notebooks/advanced/multiclass_classification.html

import numpy as np
import tensorflow as tf


import warnings
warnings.filterwarnings('ignore')  # ignore DeprecationWarnings from tensorflow

import matplotlib.pyplot as plt

import gpflow

from gpflow.utilities import print_summary, set_trainable
from gpflow.ci_utils import ci_niter

from tensorflow2_work.multiclass_classification import plot_posterior_predictions, colors

np.random.seed(0)  # reproducibility

# Number of functions and number of data points
C = 3
N = 100

# RBF kernel lengthscale
lengthscale = 0.1

# Jitter
jitter_eye = np.eye(N) * 1e-6

# Input
X = np.random.rand(N, 1)

kernel_se = gpflow.kernels.SquaredExponential(lengthscale=lengthscale)
K = kernel_se(X) + jitter_eye

# Latents prior sample
f = np.random.multivariate_normal(mean=np.zeros(N), cov=K, size=(C)).T

# Hard max observation
Y = np.argmax(f, 1).reshape(-1,).astype(int)
print(Y.shape)

# One-hot encoding
Y_hot = np.zeros((N, C), dtype=bool)
Y_hot[np.arange(N), Y] = 1

data = (X, Y)

plt.figure(figsize=(12, 6))
order = np.argsort(X.reshape(-1,))
print(order.shape)

for c in range(C):
    plt.plot(X[order], f[order, c], '.', color=colors[c], label=str(c))
    plt.plot(X[order], Y_hot[order, c], '-', color=colors[c])


plt.legend()
plt.xlabel('$X$')
plt.ylabel('Latent (dots) and one-hot labels (lines)')
plt.title('Sample from the joint $p(Y, \mathbf{f})$')
plt.grid()
plt.show()


# sum kernel: Matern32 + White
kernel = gpflow.kernels.Matern32() + gpflow.kernels.White(variance=0.01)

# Robustmax Multiclass Likelihood
invlink = gpflow.likelihoods.RobustMax(C)  # Robustmax inverse link function
likelihood = gpflow.likelihoods.MultiClass(C, invlink=invlink)  # Multiclass likelihood
Z = X[::5].copy()  # inducing inputs
#print(Z)

m = gpflow.models.SVGP(kernel=kernel, likelihood=likelihood,
    inducing_variable=Z, num_latent_gps=C, whiten=True, q_diag=True)

# Only train the variational parameters
set_trainable(m.kernel.kernels[1].variance, True)
set_trainable(m.inducing_variable, True)
print(m.inducing_variable.Z)
print_summary(m)


training_loss = m.training_loss_closure(data) 

opt.minimize(training_loss, m.trainable_variables)
print(m.inducing_variable.Z)
print_summary(m.inducing_variable.Z)


print(m.inducing_variable.Z)

# %%
plot_posterior_predictions(m, X, Y)

Would you be able to provide a [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) for what specifically you're trying to do? — STJ, May 16 '20 at 17:47
@STJ I have added the example . Please can you run it with gpflow optimizer and then with Adam optimizer . Please check the values of inducing points in both cases. — irum, May 16 '20 at 19:08
can you not run it yourself and add your results to the question? — joel, May 16 '20 at 19:17
Your example isn't actually reproducible: it relies on a "tensorflow2_work" module that isn't part of gpflow (or tensorflow). You make it much easier for other people to help you when you provide an example that can simply be copy&pasted to reproduce your issues, and that has been cut down to the minimal needed to demonstrate your question/problem, so that others can go straight to the point rather than having to find their way through your code first. Hope that makes sense! — STJ, May 16 '20 at 21:26
Actually , this is the example given in the documentation of gpflow and the module given in tensorflow2_work is just for plotting the results. it can be commented. — irum, May 16 '20 at 22:25
@joelb I have added the pictures of the result. Please can you check and explain the difference. — irum, May 16 '20 at 22:44
@irum your example also uses `opt` but doesn't define it anywhere. I've got an idea what is happening, but you'll get better answers more quickly if you make it easier for other people to actually look at what you're trying to do ... — STJ, May 18 '20 at 06:53

score 4 · Answer 1 · answered May 18 '20 at 07:17

The example given in the question isn't copy&pastable, but it seems like you simply exchange opt = gpflow.optimizers.Scipy() with opt = tf.optimizers.Adam(). The minimize() method of gpflow's Scipy optimizer runs one call of scipy.optimize.minimize, which by default runs to convergence (you can also specify a maximum number of iterations by passing, e.g., options=dict(maxiter=100) to the minimize() call).

In contrast, the minimize() method of TensorFlow optimizers runs only a single optimization step. To run more steps, say iter = 100, you need to manually write a loop:

for _ in range(iter):
    opt.minimize(model.training_loss, model.trainable_variables)

For this to actually run fast, you also need to wrap the optimization step in tf.function:

@tf.function
def optimization_step():
    opt.minimize(model.training_loss, model.trainable_variables)

for _ in range(iter):
    optimization_step()

This runs exactly iter steps - in TensorFlow you have to handle convergence checks yourself, your model may or may not be converged after this many steps.

So in your usage, you only ran one step - this did change the parameters, but presumably too little to notice the difference. (You could see a larger effect in one step by making the learning rate much higher, though that would not be a good idea for actually optimizing the model with many steps.)

Usage of the Adam optimizer with GPflow models is demonstrated in the notebook on stochastic variational inference, though it also works for non-stochastic optimization.

Note that, in any case, all parameters such as inducing point locations are set trainable by default, so your call to set_trainable(..., True) doesn't affect what's going on here.

Thank you for your detailed answer. I had already implemented the optimization in a way explained in your answer. After posting question on GitHub, I started reading the whole documentation and came across the above mention Stochastic Variational Inference example . So I resolved the issue. I really appreciate your help and quick response. — irum, May 18 '20 at 18:03

Why does a GPflow model not seem to learn anything with TensorFlow optimizers such as tf.optimizers.Adam?

1 Answers1