The GPflow docs provide an example for multi-class classification with the robust-max function. I am trying to train a multi-class classifier with the softmax likelihood instead, which is also implemented in GPflow but I can't find any documentation or examples on how to use it correctly.
Please find an example of what I have tried below. During training, the loss decreases smoothly.
The robust-max example mentioned above uses categorical labels, i.e., values 0, 1, 2, but simply replacing the robust-max by the softmax likelihood raises an IndexError in the quadrature method. Therefore, I assume this model with the softmax likelihood requires one-hot encoded labels. At test time, however, I noticed that for this three-class toy example the model never predicts class 3. Upon closer inspection, the softmax likelihood has the following method
def _log_prob(self, F, Y):
return -tf.nn.sparse_softmax_cross_entropy_with_logits(logits=F, labels=Y[:, 0])
which looks like it expects an array of categorical labels of shape [num_samples, 1]
.
What is the right way to use the softmax likelihood for GP multi-class classification?
import numpy as np
import tensorflow as tf
import gpflow
from gpflow.likelihoods.multiclass import Softmax
from tqdm.auto import tqdm
np.random.seed(0)
tf.random.set_seed(123)
# Number of functions and number of data points
num_classes = 3
N = 100
# Create training data
# Jitter
jitter_eye = np.eye(N) * 1e-6
# Input
X = np.random.rand(N, 1)
# SquaredExponential kernel matrix
kernel_se = gpflow.kernels.SquaredExponential(lengthscales=0.1)
K = kernel_se(X) + jitter_eye
# Latents prior sample
f = np.random.multivariate_normal(mean=np.zeros(N), cov=K, size=(num_classes)).T
# Hard max observation
Y = np.argmax(f, 1).reshape(-1,).astype(int)
# One-hot encoding
Y_hot = np.zeros((N, num_classes), dtype=np.int)
Y_hot[np.arange(N), Y] = 1
data = (X, Y_hot)
# sum kernel: Matern32 + White
kernel = gpflow.kernels.Matern32() + gpflow.kernels.White(variance=0.01)
likelihood = Softmax(num_classes)
m = gpflow.models.VGP(
data=data,
kernel=kernel,
likelihood=likelihood,
num_latent_gps=num_classes,
)
def run_adam(model, iterations):
"""
Utility function running the Adam optimizer
"""
# Create an Adam Optimizer action
losses = []
training_loss = model.training_loss
optimizer = tf.optimizers.Adam()
@tf.function
def optimization_step():
optimizer.minimize(training_loss, model.trainable_variables)
for step in tqdm(range(iterations), total=iterations):
optimization_step()
if step % 10 == 0:
elbo = -training_loss().numpy()
losses.append(elbo)
return losses
run_adam(model=m, iterations=10000)
y_pred = m.predict_y(X)[0]
print("Training accuracy: {:3.2f}".format(np.mean(Y == np.argmax(y_pred, axis=1))))