SVGP for US Flight data

Question

My problem is the optimization issue for SVIGP in the US Flight dataset. I implemented the SVGP model for the US flight data mentioned in the Hensman 2014 using the number of inducing point = 100, batch_size = 1000, learning rate = 1e-5 and maxiter = 500.

The result is pretty strange end ELBO does not increase and it have large variance no matter how I tune the learning rate

Initialization

M = 100
D = 8
def init():
    kern = gpflow.kernels.RBF(D, 1, ARD=True)
    Z = X_train[:M, :].copy()
    m = gpflow.models.SVGP(X_train, Y_train.reshape([-1,1]), kern, gpflow.likelihoods.Gaussian(), Z, minibatch_size=1000)
    return m
m = init()

Inference

m.feature.trainable = True
opt = gpflow.train.AdamOptimizer(learning_rate = 0.00001)
m.compile()
opt.minimize(m, step_callback=logger, maxiter = 500)
plt.plot(logf)
plt.xlabel('iteration')
plt.ylabel('ELBO')

Result：

Added Results

Once I add more iterations and use large learning rate. It is good to see that ELBO increases as iterations increase. But it is very confused that both RMSE(root mean square error) for training and testing data increase too. Do you have some suggestions? Figures and codes shown as follows:

ELBOs vs iterations

Train RMSEs vs iterations

Test RMSEs vs iterations

Using logger

def logger(x):
    print(m.compute_log_likelihood())
    logx.append(x)
    logf.append(m.compute_log_likelihood())
    logt.append(time.time() - st)
    py_train = m.predict_y(X_train)[0]
    py_test = m.predict_y(X_test)[0]
    rmse_hist.append(np.sqrt(np.mean((Y_train - py_train)**2)))
    rmse_test_hist.append(np.sqrt(np.mean((Y_test - py_test)**2)))
    logger.i+=1
logger.i = 1

And the full code is shown through link.

500 iterations is far too few. I don't know how many it needs to complete off the top of my head, but try running it for 50000 iterations with a learning rate of 1e-3. — Mark van der Wilk, Aug 15 '19 at 20:39
I changed the tuning parameters, but the changings of RMSEs of both training and testing data are very strange. Could you please take a look? — Rui Meng, Aug 17 '19 at 17:02
Have you looked at what your hyperparameters go to? I suspect the prior variance of the GP may be going to zero. This is a local optimum. The usual way to deal with this, is to fix the kernel hyperparameters for some initial iterations. Hensman 2013 does this. A good way to identify when this happens is to compare the elbo to that of a noise model. — Mark van der Wilk, Aug 18 '19 at 18:06
I find out that my issue is that the optimization is difficult to learn the ARD length-scale parameters. After I do a normalization on the raw data for each input dimension. The algorithm works. Thanks for your suggestions! — Rui Meng, Aug 20 '19 at 02:50

SVGP for US Flight data

Initialization

Inference

Result：

Added Results

0 Answers0