Multi-class classification with softmax likelihood

Question

The GPflow docs provide an example for multi-class classification with the robust-max function. I am trying to train a multi-class classifier with the softmax likelihood instead, which is also implemented in GPflow but I can't find any documentation or examples on how to use it correctly.

Please find an example of what I have tried below. During training, the loss decreases smoothly.

The robust-max example mentioned above uses categorical labels, i.e., values 0, 1, 2, but simply replacing the robust-max by the softmax likelihood raises an IndexError in the quadrature method. Therefore, I assume this model with the softmax likelihood requires one-hot encoded labels. At test time, however, I noticed that for this three-class toy example the model never predicts class 3. Upon closer inspection, the softmax likelihood has the following method

def _log_prob(self, F, Y):
        return -tf.nn.sparse_softmax_cross_entropy_with_logits(logits=F, labels=Y[:, 0])

which looks like it expects an array of categorical labels of shape [num_samples, 1].

What is the right way to use the softmax likelihood for GP multi-class classification?

import numpy as np
import tensorflow as tf
import gpflow
from gpflow.likelihoods.multiclass import Softmax
from tqdm.auto import tqdm


np.random.seed(0)
tf.random.set_seed(123)

# Number of functions and number of data points
num_classes = 3
N = 100

# Create training data
# Jitter
jitter_eye = np.eye(N) * 1e-6

# Input
X = np.random.rand(N, 1)

# SquaredExponential kernel matrix
kernel_se = gpflow.kernels.SquaredExponential(lengthscales=0.1)
K = kernel_se(X) + jitter_eye

# Latents prior sample
f = np.random.multivariate_normal(mean=np.zeros(N), cov=K, size=(num_classes)).T

# Hard max observation
Y = np.argmax(f, 1).reshape(-1,).astype(int)

# One-hot encoding
Y_hot = np.zeros((N, num_classes), dtype=np.int)
Y_hot[np.arange(N), Y] = 1

data = (X, Y_hot)

# sum kernel: Matern32 + White
kernel = gpflow.kernels.Matern32() + gpflow.kernels.White(variance=0.01)
likelihood = Softmax(num_classes)
m = gpflow.models.VGP(
    data=data,
    kernel=kernel,
    likelihood=likelihood,
    num_latent_gps=num_classes,
)


def run_adam(model, iterations):
    """
    Utility function running the Adam optimizer
    """

    # Create an Adam Optimizer action
    losses = []
    training_loss = model.training_loss
    optimizer = tf.optimizers.Adam()

    @tf.function
    def optimization_step():
        optimizer.minimize(training_loss, model.trainable_variables)

    for step in tqdm(range(iterations), total=iterations):
        optimization_step()
        if step % 10 == 0:
            elbo = -training_loss().numpy()
            losses.append(elbo)
    return losses


run_adam(model=m, iterations=10000)
y_pred = m.predict_y(X)[0]
print("Training accuracy: {:3.2f}".format(np.mean(Y == np.argmax(y_pred, axis=1))))

This really should be part of the GPflow documentation's example notebooks. Please can you open a [documentation issue](https://github.com/GPflow/GPflow/issues/new?template=doc-issue.md) on the GPflow github? — STJ, Oct 09 '20 at 14:03
I have opened an issue [here](https://github.com/GPflow/GPflow/issues/1591) — Explicat, Oct 11 '20 at 20:31

score 1 · Answer 1 · edited Dec 03 '20 at 16:10

Softmax uses Monte Carlo to estimate the expected log likelihood, so you do always need Adam. The nice thing about RobustMax is that for the full dataset, you can use bfgs. My experience is that RobustMax isn't that great from a statistical perspective though...

I can never remember what form the labels should be passed in, and it's very possible that there is a difference between RobustMax and SoftMax. You found the correct method, and I would refer to the TensorFlow documentation of tf.nn.sparse_softmax_cross_entropy_with_logits() to determine what the right format is.

Multi-class classification with softmax likelihood

1 Answers1