Different results when optimizing hyperparameter for a Gaussian process regression

Question

i'm studying gaussian process regression, and i'm trying to use the built-in functions from scikit-learn, and also trying to impement a custom function for doing so.

This is the code when using scikit-learn:

import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor as gpr
from sklearn.gaussian_process.kernels import RBF,WhiteKernel,ConstantKernel as C
from scipy.optimize import minimize
import scipy.stats as s

X = np.linspace(0,10,10).reshape(-1,1) # Input Values
Y = 2*X + np.sin(X) # Function

v = 1
kernel = v*RBF() + WhiteKernel() #Defining kernel
gp = gpr(kernel=kernel,n_restarts_optimizer=50).fit(X,Y) #fitting the process to get optimized 
hyperparameter
gp.kernel_ #Hyperparameters optimized by the GPR function in scikit-learn 
Out[]: 14.1**2 * RBF(length_scale=3.7) + WhiteKernel(noise_level=1e-05) #result

And this is the code i wrote manually:

def marglike(par,X,Y): #defining log-marginal-likelihood
# print(par)
l,var,sigma_n = par
n = len(X)
dist_X = (X - X.T)**2
# print(dist_X)
k = var*np.exp(-(1/(2*(l**2)))*dist_X)
inverse = np.linalg.inv(k + (sigma_n**2)*np.eye(len(k))) 
ml = (1/2)*np.dot(np.dot(Y.T,inverse),Y) + (1/2)*np.log(np.linalg.det(k + 
(sigma_n**2)*np.eye(len(k)))) + (n/2)*np.log(2*np.pi)
return ml

b= [0.0005,100]
bnd = [b,b,b] #bounds used for "minimize" function
start = np.array([1.1,1.6,0.05]) #initial hyperparameters values
re = minimize(marglike,start,args=(X,Y),method="L-BFGS-B",options = {'disp':True},bounds=bnd) #the 
method used is the same as the one used by scikit-learn
re.x #Hyperparameter results
Out[]: array([3.55266484e+00, 9.99986210e+01, 5.00000000e-04])

As you can see, the hyperparameter i got from the 2 methods are different, but yet i used the same data(X,Y) and same minimization method.

Could somebody help me to understand why and maybe how to get same results ?!

those might be basically "the same", it's difficult to say more with the information provided. could just be numerical stability issues in your code, your could try [GPy](https://github.com/SheffieldML/GPy) as another implementation to check your code against — Sam Mason, Feb 19 '21 at 12:26
Thanks, i will try. I have just noticed that if i set same bounds and same initial parameters (for both metods) i get almost the same results. I will give it a try with GPy — Pasquale Russo, Feb 19 '21 at 12:46
actually tried running your code, and it mostly works. I'd suggest trying with more data points (10 doesn't really give it enough to fit three parameters, 50 might be a better place to start) and add in some noise (gaussian would probably a good fit) — Sam Mason, Feb 19 '21 at 12:46

score 1 · Accepted Answer · answered Feb 20 '21 at 18:17

As suggested by San Mason, adding noise actually works! Otherwise, while you do it manually (in the custom code), set the initial noise to reasonably low and have multiple restarts with different initializations then you will get values close by. By the way, noiseless data seems to be creating a stationary ridge in the space of hyperparameters (like Fig. 1.6 in Surrogates GP book). Note that scikit-learn noise is sigma_n^2 for your custom function. Below are the snippets of noisy and noise-less cases.

Noise-less case

scikit-learn

import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor as gpr
from sklearn.gaussian_process.kernels import RBF,WhiteKernel,ConstantKernel as C
from scipy.optimize import minimize
import scipy.stats as s

X = np.linspace(0,10,10).reshape(-1,1) # Input Values
Y = 2*X + np.sin(X) #+ np.random.normal(10)# Function

v = 1
kernel = v*RBF() + WhiteKernel() #Defining kernel
gp = gpr(kernel=kernel,n_restarts_optimizer=50).fit(X,Y) #fitting the process to get optimized 
# hyperparameter
gp.kernel_ #Hyperparameters optimized by the GPR function in scikit-learn 
# Out[]: 14.1**2 * RBF(length_scale=3.7) + WhiteKernel(noise_level=1e-05) #result

custom function

def marglike(par,X,Y): #defining log-marginal-likelihood
    # print(par)
    l,std,sigma_n = par
    n = len(X)
    dist_X = (X - X.T)**2
    # print(dist_X)
    k = std**2*np.exp(-(dist_X/(2*(l**2)))) + (sigma_n**2)*np.eye(n)
    inverse = np.linalg.inv(k) 
    ml = (1/2)*np.dot(np.dot(Y.T,inverse),Y) + (1/2)*np.log(np.linalg.det(k)) + (n/2)*np.log(2*np.pi)
    return ml[0,0]

b= [10**-5,10**5]
bnd = [b,b,b] #bounds used for "minimize" function
start = [1,1,10**-5] #initial hyperparameters values
re = minimize(fun=marglike,x0=start,args=(X,Y),method="L-BFGS-B",options = {'disp':True},bounds=bnd) #the 
# method used is the same as the one used by scikit-learn
re.x[1], re.x[0], re.x[2]**2
# Output - (9.920690495739379, 3.5657912350017575, 1.0000000000000002e-10)

Noisy case

scikit-learn

import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor as gpr
from sklearn.gaussian_process.kernels import RBF,WhiteKernel,ConstantKernel as C
from scipy.optimize import minimize
import scipy.stats as s

X = np.linspace(0,10,10).reshape(-1,1) # Input Values
Y = 2*X + np.sin(X) + np.random.normal(size=10).reshape(10,1)*0.1 # Function

v = 1
kernel = v*RBF() + WhiteKernel() #Defining kernel
gp = gpr(kernel=kernel,n_restarts_optimizer=50).fit(X,Y) #fitting the process to get optimized 
# hyperparameter
gp.kernel_ #Hyperparameters optimized by the GPR function in scikit-learn 
# Out[]: 10.3**2 * RBF(length_scale=3.45) + WhiteKernel(noise_level=0.00792) #result

Custom function

def marglike(par,X,Y): #defining log-marginal-likelihood
    # print(par)
    l,std,sigma_n = par
    n = len(X)
    dist_X = (X - X.T)**2
    # print(dist_X)
    k = std**2*np.exp(-(dist_X/(2*(l**2)))) + (sigma_n**2)*np.eye(n)
    inverse = np.linalg.inv(k) 
    ml = (1/2)*np.dot(np.dot(Y.T,inverse),Y) + (1/2)*np.log(np.linalg.det(k)) + (n/2)*np.log(2*np.pi)
    return ml[0,0]

b= [10**-5,10**5]
bnd = [b,b,b] #bounds used for "minimize" function
start = [1,1,10**-5] #initial hyperparameters values
re = minimize(fun=marglike,x0=start,args=(X,Y),method="L-BFGS-B",options = {'disp':True},bounds=bnd) #the 
# method used is the same as the one used by scikit-learn
re.x[1], re.x[0], re.x[2]**2
# Output - (10.268943740577331, 3.4462604625225106, 0.007922681239535326)

Thank you both for you precious help, i really appreciate it! — Pasquale Russo, Feb 21 '21 at 16:18

Different results when optimizing hyperparameter for a Gaussian process regression

1 Answers1

Noise-less case

scikit-learn

custom function

Noisy case

scikit-learn

Custom function