GaussianProcessRegressor fitting perfectly but poor perfomance on test data?

Question

I am trying to understand GPR, and I am testing it to predict some values. The response is the first component of a PCA, so it has relatively good data without outliers. The predictors also come from a PCA(n=2), and buth predictors columns has been standarized with StandardScaler().fit_transform, as I saw it was better in previous posts. Since the predictors are standarized, I am using a RBF kernel and mutiplying it by 1**2, and let the hyperparameters fit. The thing is that the model fits perfectly to predictors, and gives almost constant values for the test data. The set is a set of 463 points, and no matter if I randomize 20-100 or 200 for the train data, adding Whitekernel() or alpha values, I have the same result. I am almost certain that I am doing something wrong, but I can't find what, any help? Here's relevant chunk of code and the responses:

k1 = cKrnl(1**2,(1e-40, 1e40)) *  RBF(2, (1e-40, 1e40))
k2 = cKrnl(1**2,(1e-40, 1e40)) *  RBF(2, (1e-40, 1e40))
kernel = k1 + k2 
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10,normalize_y = True)
gp.fit(x_train, y_train)
print("GPML kernel: %s" % gp.kernel_)

Output : GPML kernel: 1**2 * RBF(length_scale=0.000388) + 8.01e-18**2 * RBF(length_scale=2.85e-18)

Training data:

Test data and prediction:

Thanks to all!!!

Learned Lengthscales in both kernels are very small. So, the fit is not very good. Have you tried how other ML models perform with your data? — Zeel B Patel, May 13 '21 at 18:04
Hi! Thanks for the comment. I tries randomforest regression and works a little better, inthe sense that at leats predictions on the test set are not constant. But It puzzles me why in GPR the train set is fitted perfectly, however the test set is terrible in the sense that the model seems to be not doing anything :S — ger.code, May 13 '21 at 19:39
In GPR, the lengthscale parameter decides the effect of train data on new input locations (x) close to the train points. So, if the length scale is too small, it means predictions will get back to prior (zero) within a short distance from train locations. Thus, predictions look zero always in your case. Best way to visualize this effect is to create test locations at fine-grained level and check the predictions. For example, if your train locations are 1,2,3,4,5... you create test locations as 0.1,0.2,0.3,.... and check the output. — Zeel B Patel, May 14 '21 at 09:13
@ZeelBPatel thank you very much for the clarifying!!! I'll investigate the matter with more suitable data and try to know what it is that's making the lenghtscales go so low. Actually randomforest fits kind of good, but also error margins are too wide, so I'm guessing the input data is kind of broken, cause it also has some issues with the training data. — ger.code, May 14 '21 at 13:51
Yes, if your motive is just exploration with GPR, I would suggest generate your own data with known parameters such as ```np.sin(x)+np.random(mean,scale,size=x.shape[0])```. Here, you would be able to see the lengthscale effect more clearly. — Zeel B Patel, May 14 '21 at 16:44

GaussianProcessRegressor fitting perfectly but poor perfomance on test data?

0 Answers0