0

Matlab

I have some PLSR regression in Matlab that I need to translate to Python. The Matlab code is as follows:

% PLSR with 15 latent variables
[~,~,~,~,~,MSEcv] = plsregress(X,y,15,"cv",5,"MCReps",100);
rRMSE = 100*sqrt(MSEcv(2,:))/max(y)-min(y));
% Find optimal number of latent variables
minl = 1; maxl = 15
nlv = find(rMSE(minl+1:maxl+1)==min(rRMSE(minl+1:maxl+1)))-1+minl;
nlv = nlv(1)
% Fit with optimal number of latent variables
[XL,yl,XS,YS,beta,PCTVAR;MSE;stats] = plsregress(X,y,nlv)

(PLS regression with a maximum of 15 latent variables, cross-validation with 1/5 of the values and 100 Monte-Carlo repetitions.)


Python

I tried to implement the code in Python as follows:

import numpy as np
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import cross_validate
from sklearn.model_selection import ShuffleSplit

# Maximum number of latent variables
mlv = 15
# Cross-validation
cv = 5
# Monte-Carlo repetitions
mcr = 100

# 1...mlv to fit models with various number of latent variables
try_latent_vars = np.arange(1, mlv)

##----------------------------------------------------------------------------|
# Define funtion to fit PLSmodel
def optimise_pls_cv(X_vals, y_vals, n_comp, crossval, mcreps):
    '''Fit PLS regression model using cross-validation.'''
    # Define PLS object
    pls = PLSRegression(n_components = n_comp)

    # Cross-validation fit
    cv_split = ShuffleSplit(n_splits = mcreps, test_size = 1/crossval,
                            random_state = 0)
    cvs = cross_validate(pls, X_vals, y_vals, cv = cv_split,
                         scoring = ["r2", "neg_mean_squared_error"])
    mean_r2_error = np.mean(cvs["test_r2"])
    test_mse = -np.mean(cvs["test_neg_mean_squared_error"])
    return pls, mean_r2_error, test_mse

##----------------------------------------------------------------------------|
# Fit PLS model

# Empty lists to store R^2 and mean squared error values
r2s = []
mses = []

for n_comp in try_latent_vars:
    model, r2, mse = optimise_pls_cv(X_vals = X.T,
                                     y_vals = y,
                                     n_comp = n_comp,
                                     crossval = cv,
                                     mcreps = mcr)
    r2s.append(r2)
    mses.append(mse)

index_max_r2s = np.argmax(r2s)
lv = try_latent_vars[index_max_r2s]

##----------------------------------------------------------------------------|
## Fit model with optimized number of components
model, r2, mse = optimise_pls_cv(X_vals = X.T,
                                 y_vals = y,
                                 n_comp = lv,
                                 crossval = cv,
                                 mcreps = mcr)
metrics = {"R2" : r2,
           "MSE" : mse}
metrics_str = "R2: %0.4f, MSE: %0.4f" % (r2, mse)

In both code snippets, X and y are a set of spectral information and some variable to predict from the spectra, respectively.


The problem

Unfortunately, I get vastly different results with those versions (R^2 values for the Matlab version are all 0.5 or higher while the Python version ends up with R^2 values close to zero or negative). How is that and how can I solve this issue?


Example data

Here is some example data as .txt files in Dropbox. Read as

import pandas as pd
X = pd.read_table("/path/to/saved/files/X.txt", header = None)
y = pd.read_table("/path/to/saved/files/y.txt", header = None)
Manuel Popp
  • 1,003
  • 1
  • 10
  • 33

1 Answers1

0

I found that sklearn uses the NIPALS algorithm while Matlab's plsregress uses the SIMPLS algorithm to estimate PLS regression parameters. For some of my data, this does not make a huge difference, whereas I found the NIPALS version to produce output of highly variable quality for slight variation of the used sample spectra (e.g., outlier removal or continuum removal may have a huge impact for some variables).

Manuel Popp
  • 1,003
  • 1
  • 10
  • 33