Matlab
I have some PLSR regression in Matlab that I need to translate to Python. The Matlab code is as follows:
% PLSR with 15 latent variables
[~,~,~,~,~,MSEcv] = plsregress(X,y,15,"cv",5,"MCReps",100);
rRMSE = 100*sqrt(MSEcv(2,:))/max(y)-min(y));
% Find optimal number of latent variables
minl = 1; maxl = 15
nlv = find(rMSE(minl+1:maxl+1)==min(rRMSE(minl+1:maxl+1)))-1+minl;
nlv = nlv(1)
% Fit with optimal number of latent variables
[XL,yl,XS,YS,beta,PCTVAR;MSE;stats] = plsregress(X,y,nlv)
(PLS regression with a maximum of 15 latent variables, cross-validation with 1/5 of the values and 100 Monte-Carlo repetitions.)
Python
I tried to implement the code in Python as follows:
import numpy as np
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import cross_validate
from sklearn.model_selection import ShuffleSplit
# Maximum number of latent variables
mlv = 15
# Cross-validation
cv = 5
# Monte-Carlo repetitions
mcr = 100
# 1...mlv to fit models with various number of latent variables
try_latent_vars = np.arange(1, mlv)
##----------------------------------------------------------------------------|
# Define funtion to fit PLSmodel
def optimise_pls_cv(X_vals, y_vals, n_comp, crossval, mcreps):
'''Fit PLS regression model using cross-validation.'''
# Define PLS object
pls = PLSRegression(n_components = n_comp)
# Cross-validation fit
cv_split = ShuffleSplit(n_splits = mcreps, test_size = 1/crossval,
random_state = 0)
cvs = cross_validate(pls, X_vals, y_vals, cv = cv_split,
scoring = ["r2", "neg_mean_squared_error"])
mean_r2_error = np.mean(cvs["test_r2"])
test_mse = -np.mean(cvs["test_neg_mean_squared_error"])
return pls, mean_r2_error, test_mse
##----------------------------------------------------------------------------|
# Fit PLS model
# Empty lists to store R^2 and mean squared error values
r2s = []
mses = []
for n_comp in try_latent_vars:
model, r2, mse = optimise_pls_cv(X_vals = X.T,
y_vals = y,
n_comp = n_comp,
crossval = cv,
mcreps = mcr)
r2s.append(r2)
mses.append(mse)
index_max_r2s = np.argmax(r2s)
lv = try_latent_vars[index_max_r2s]
##----------------------------------------------------------------------------|
## Fit model with optimized number of components
model, r2, mse = optimise_pls_cv(X_vals = X.T,
y_vals = y,
n_comp = lv,
crossval = cv,
mcreps = mcr)
metrics = {"R2" : r2,
"MSE" : mse}
metrics_str = "R2: %0.4f, MSE: %0.4f" % (r2, mse)
In both code snippets, X
and y
are a set of spectral information and some variable to predict from the spectra, respectively.
The problem
Unfortunately, I get vastly different results with those versions (R^2 values for the Matlab version are all 0.5 or higher while the Python version ends up with R^2 values close to zero or negative). How is that and how can I solve this issue?
Example data
Here is some example data as .txt files in Dropbox. Read as
import pandas as pd
X = pd.read_table("/path/to/saved/files/X.txt", header = None)
y = pd.read_table("/path/to/saved/files/y.txt", header = None)