0

I'm trying to perform feature selection on some spectroscopy data using SKLearn's RFECV function. I want to use a pipeline with PLSRegression as its last step, as the estimator for the RFECV function. However I'm getting different (clearly wrong) results when using PLSRegression in a pipeline, versus just on its own. Error details and minimum repeatable examples below.

  1. Imports and setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cross_decomposition import PLSRegression
from sklearn.pipeline import Pipeline

# Use Github data from NIRPY blog as sample data - using NIR spectra to predict peach brix values
df = pd.read_csv(r'https://raw.githubusercontent.com/nevernervous78/nirpyresearch/master/data/peach_spectra_brix.csv')
X = df.iloc[:,1:].values
y = df.iloc[:,0].values
  1. First, doing the RFECV with a normal PLSRegression object, to show what I'm expecting.
pls = PLSRegression(n_components=3)
selector = RFECV(pls, 
                 step=1, 
                 cv=5, 
                 verbose=0, 
                 n_jobs=2, 
                 min_features_to_select=(2*np.shape(X)[0]), # Don't remove too many wavelengths
                 scoring='neg_root_mean_squared_error')
selector = selector.fit(X, y)

fig = plt.figure()
plt.plot(selector.ranking_)
plt.show()

Which yields this plot, with all the rank=1 wavelengths being selected by the algorithm in different regions of the spectrum, as expected:

PLSRegression ranking

  1. However the real data I'm analyzing will be sent through a pipeline with various steps, the last of which is a PLSRegression object. So trying this with a typical SKLearn pipeline, with PLS as the only step for simplicity here:
pipe = Pipeline([('pls', PLSRegression(n_components=3))])

selector2 = RFECV(pipe, 
                 step=1, 
                 cv=5, 
                 verbose=0, 
                 n_jobs=2, 
                 min_features_to_select=(2*np.shape(X)[0]),
                 scoring='neg_root_mean_squared_error')
selector2 = selector2.fit(X, y)

But this yields an error:

ValueError: when `importance_getter=='auto'`, the underlying estimator Pipeline should have `coef_` or
`feature_importances_` attribute. Either pass a fitted estimator to feature selector or call fit before
calling transform.
  1. So I found a potential workaround to give my Pipeline class a coef_ attribute, per best-found PCA estimator to be used as the estimator in RFECV :
class Mypipeline(Pipeline):
    @property
    def coef_(self):
        return self._final_estimator.coef_
    @property
    def feature_importances_(self):
        return self._final_estimator.feature_importances_

mypipe = Mypipeline([('pls', PLSRegression(n_components=3))])

selector3 = RFECV(mypipe, 
                 step=1, 
                 cv=5, 
                 verbose=0, 
                 n_jobs=-2, 
                 min_features_to_select=(2*np.shape(X)[0]),
                 scoring='neg_root_mean_squared_error')
selector3 = selector3.fit(X, y)

On the bright side, we've avoided the error. On the not so bright side, the wavelengths have been ranked sequentially in descending order so the last 2*np.shape(X)[0] features are used, and the rest ignored. The ranking plot looks like this:

Ranking from using a custom pipeline

Which is clearly wrong.

  1. My next attempt tried to make use of the importance_getter parameter instead, as described in the docs for the RFECV function:
selector4 = RFECV(pipe, 
                 step=1, 
                 cv=5, 
                 verbose=0, 
                 n_jobs=-2, 
                 min_features_to_select=(2*np.shape(X)[0]),
                 scoring='neg_root_mean_squared_error',
                 importance_getter=pipe.named_steps.pls.coef_)
selector4 = selector4.fit(X, y)

But of course pipe wasn't fitted yet so I get another error, AttributeError: 'PLSRegression' object has no attribute '_coef_'.

  1. Ok, maybe I need to fit pipe before using it with the selector?
pipe = Pipeline([('pls', PLSRegression(n_components=3))])
pipe.fit(X, y)
selector5 = RFECV(pipe, 
                 step=1, 
                 cv=5, 
                 verbose=0, 
                 n_jobs=-2, 
                 min_features_to_select=(2*np.shape(X)[0]),
                 scoring='neg_root_mean_squared_error',
                 importance_getter=pipe.named_steps.pls.coef_)
selector5 = selector5.fit(X, y)

Nope, new error:

ValueError: `importance_getter` has to be a string or `callable`
  1. So maybe make importance_getter into a function?
def importance_getter(pipe):
    return pipe.named_steps.pls.coef_

pipe = Pipeline([('pls', PLSRegression(n_components=3))])

pipe.fit(X, y)

selector6 = RFECV(pipe, 
                 step=1, 
                 cv=5, 
                 verbose=0, 
                 n_jobs=-2, 
                 min_features_to_select=(2*np.shape(X)[0]),
                 scoring='neg_root_mean_squared_error',
                 importance_getter=importance_getter(pipe))

selector6 = selector6.fit(X, y)

Which returns the same error as (5). Long story short I need help figuring out how to do this properly!

Thanks

danronmoon
  • 3,814
  • 5
  • 34
  • 56
sean412
  • 13
  • 2
  • 1
    Update: I read the docs again and noticed the `named_steps.pls.coef_` for `importance_getter` should actually be passed as a string, so I tried that that but got the same results as (3): ```pipe = Pipeline([('pls', PLSRegression(n_components=3))]) pipe.fit(X, y) selector7 = RFECV(pipe, step=1, cv=5, verbose=0, n_jobs=2, min_features_to_select=(2*np.shape(X)[0]), scoring='neg_root_mean_squared_error', importance_getter='named_steps.pls.coef_') selector7 = selector7.fit(X, y) fig = plt.figure() plt.plot(selector7.ranking_) plt.show() ``` – sean412 Apr 25 '23 at 23:26

0 Answers0