Feature-selection and prediction

Question

from sklearn.feature_selection import RFECV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris

I have X and Y data.

data = load_iris()    
X = data.data
Y = data.target

I would like to implement RFECV feature selection and prediction with k-fold validation approach.

code corrected from the answer @ https://stackoverflow.com/users/3374996/vivek-kumar

clf = RandomForestClassifier()

kf = KFold(n_splits=2, shuffle=True, random_state=0)  

estimators = [('standardize' , StandardScaler()),
              ('clf', clf)]

class Mypipeline(Pipeline):
    @property
    def coef_(self):
        return self._final_estimator.coef_
    @property
    def feature_importances_(self):
        return self._final_estimator.feature_importances_ 

pipeline = Mypipeline(estimators)

rfecv = RFECV(estimator=pipeline, cv=kf, scoring='accuracy', verbose=10)
rfecv_data = rfecv.fit(X, Y)

print ('no. of selected features =', rfecv_data.n_features_)

EDIT (for small remaining):

X_new = rfecv.transform(X)
print X_new.shape

y_predicts = cross_val_predict(clf, X_new, Y, cv=kf)
accuracy = accuracy_score(Y, y_predicts)
print ('accuracy =', accuracy)

Vivek Kumar · Accepted Answer · 2018-07-20T09:01:49.267

Instead of wrapping StandardScaler and RFECV in a same pipeline, do that for StandardScaler and RandomForestClassifier and pass that pipeline to the RFECV as an estimator. In this no traininf info will be leaked.

estimators = [('standardize' , StandardScaler()),
              ('clf', RandomForestClassifier())]

pipeline = Pipeline(estimators)


rfecv = RFECV(estimator=pipeline, scoring='accuracy')
rfecv_data = rfecv.fit(X, Y)

Update: About the error 'RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes'

Yes thats a known issue in scikit-learn pipeline. You can look at my other answer here for more details and use the new pipeline I created there.

Define a custom pipeline like this:

class Mypipeline(Pipeline):
    @property
    def coef_(self):
        return self._final_estimator.coef_
    @property
    def feature_importances_(self):
        return self._final_estimator.feature_importances_

And use that:

pipeline = Mypipeline(estimators)

rfecv = RFECV(estimator=pipeline, scoring='accuracy')
rfecv_data = rfecv.fit(X, Y)

Update 2:

@brute, For your data and code, the algorithms completes within a minute on my PC. This is the complete code I use:

import numpy as np
import glob
from sklearn.utils import resample
files = glob.glob('/home/Downloads/Untitled Folder/*') 
outs = [] 
for fi in files: 
    data = np.genfromtxt(fi, delimiter='|', dtype=float) 
    data = data[~np.isnan(data).any(axis=1)] 
    data = resample(data, replace=False, n_samples=1800, random_state=0) 
    outs.append(data) 

X = np.vstack(outs) 
print X.shape 
Y = np.repeat([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 1800) 
print Y.shape

#from sklearn.utils import shuffle
#X, Y = shuffle(X, Y, random_state=0)

from sklearn.feature_selection import RFECV
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

clf = RandomForestClassifier()

kf = KFold(n_splits=10, shuffle=True, random_state=0)  

estimators = [('standardize' , StandardScaler()),
              ('clf', RandomForestClassifier())]

class Mypipeline(Pipeline):
    @property
    def coef_(self):
        return self._final_estimator.coef_
    @property
    def feature_importances_(self):
        return self._final_estimator.feature_importances_ 

pipeline = Mypipeline(estimators)

rfecv = RFECV(estimator=pipeline, scoring='accuracy', verbose=10)
rfecv_data = rfecv.fit(X, Y)

print ('no. of selected features =', rfecv_data.n_features_)

Update 3: For cross_val_predict

X_new = rfecv.transform(X)
print X_new.shape

# Here change clf to pipeline, 
# because RFECV has found features according to scaled data,
# which is not present when you pass clf 
y_predicts = cross_val_predict(pipeline, X_new, Y, cv=kf)
accuracy = accuracy_score(Y, y_predicts)
print ('accuracy =', accuracy)

@EkabaBisong Maybe its a little complex, but not un-necessary. This is done to prevent data leakage. — Vivek Kumar, Jul 20 '18 at 06:51

Ekaba Bisong · Answer 2 · 2018-07-19T09:34:36.653

Here's how we'll do it:

Fit on the training set

from sklearn.feature_selection import RFECV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

data = load_iris()    
X = data.data, Y = data.target

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, shuffle=True)

# create model
clf = RandomForestClassifier()    
# instantiate K-Fold
kf = KFold(n_splits=10, shuffle=True, random_state=0)

# pipeline estimators
estimators = [('standardize' , StandardScaler()),
             ('rfecv', RFECV(estimator=clf, cv=kf, scoring='accuracy'))]

# instantiate pipeline
pipeline = Pipeline(estimators)    
# fit rfecv to train model
rfecv_model = rfecv_model = pipeline.fit(X_train, y_train)

# print number of selected features
print ('no. of selected features =', pipeline.named_steps['rfecv'].n_features_)
# print feature ranking
print ('ranking =', pipeline.named_steps['rfecv'].ranking_)

'Output':
no. of selected features = 3
ranking = [1 2 1 1]

Predict on the test set

# make predictions on the test set
predictions = rfecv_model.predict(X_test)

# evaluate the model performance using accuracy metric
print("Accuracy on test set: ", accuracy_score(y_test, predictions))

'Output':
Accuracy:  0.9736842105263158

@brute No. This code does not use StandardScaler anywhere. You just define it inside the pipeline, but its not used (fitted anywhere). When you do this `pipeline.named_steps['rfecv'].fit(X_train, y_train)` you are directly using the RFECV on the original data, not scaled data. — Vivek Kumar, Jul 19 '18 at 08:54
@brute. Code updated. This uses the pipeline properly to scale and fir with RFE. — Ekaba Bisong, Jul 19 '18 at 09:35
@VivekKumar. Please define leaking data. You're wrong. I really don't care what `rfecv` does with the training data `x_train`. The important thing here is that we first split the dataset into a training and a testing set using the `train_test_split` method. We fit the `rfecv` method of the `train` set, and predict on the `test` set. **No data from the `test` set is leaked into the `train` set**. Do not confuse the OP. — Ekaba Bisong, Jul 20 '18 at 05:08
Thats the problem. RFECV will again split the X_train into train and test (using cv folds), there the data is scaled before split so train data of rfecv and then the model knows about the that test data because its scaled using the test data (Here I am talking about the internal train and test). Then you find that these many features are important for this, which will be biased. — Vivek Kumar, Jul 20 '18 at 14:31