sklearn: Have an estimator that filters samples

Question

I'm trying to implement my own Imputer. Under certain conditions, I would like to filter some of the train samples (that I deem low quality).

However, since the transform method returns only X and not y, and y itself is a numpy array (which I can't filter in place to the best of my knowledge), and moreover - when I use GridSearchCV- the y my transform method receives is None, I can't seem to find a way to do it.

Just to clarify: I'm perfectly clear on how to filter arrays. I can't find a way to fit sample filtering on the y vector into the current API.

I really want to do that from a BaseEstimator implementation so that I could use it with GridSearchCV (it has a few parameters). Am I missing a different way to achieve sample filtration (not through BaseEstimator, but GridSearchCV compliant)? is there some way around the current API?

https://stackoverflow.com/a/70191787/10375049 – Marco Cerliani Dec 02 '21 at 09:12 — Marco Cerliani, Dec 02 '21 at 09:12

score 15 · Accepted Answer · answered Jul 23 '14 at 17:28

I have found a solution, which has three parts:

Have the if idx == id(self.X): line. This will make sure samples are filtered only on the training set.
Override fit_transform to make sure the transform method gets y and not None
Override the Pipeline to allow tranform to return said y.

Here's a sample code demonstrating it, I guess it might not cover all the tiny details but I think it solved the major issue which is with the API.

from sklearn.base import BaseEstimator
from mne.decoding.mixin import TransformerMixin
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn import cross_validation
from sklearn.grid_search import GridSearchCV
from sklearn.externals import six

class SampleAndFeatureFilter(BaseEstimator, TransformerMixin):
    def __init__(self, perc = None):
        self.perc = perc

    def fit(self, X, y=None):
        self.X = X
        sum_per_feature = X.sum(0)
        sum_per_sample = X.sum(1)
        self.featurefilter = sum_per_feature >= np.percentile(sum_per_feature, self.perc)
        self.samplefilter  = sum_per_sample >= np.percentile(sum_per_sample, self.perc)
        return self

    def transform(self, X, y=None, copy=None):
        idx = id(X)
        X=X[:,self.featurefilter]
        if idx == id(self.X):
            X = X[self.samplefilter, :]
            if y is not None:
                y = y[self.samplefilter]
            return X, y
        return X

    def fit_transform(self, X, y=None, **fit_params):
        if y is None:
            return self.fit(X, **fit_params).transform(X)
        else:
            return self.fit(X, y, **fit_params).transform(X,y)

class PipelineWithSampleFiltering(Pipeline):
    def fit_transform(self, X, y=None, **fit_params):
        Xt, yt, fit_params = self._pre_transform(X, y, **fit_params)
        if hasattr(self.steps[-1][-1], 'fit_transform'):
            return self.steps[-1][-1].fit_transform(Xt, yt, **fit_params)
        else:
            return self.steps[-1][-1].fit(Xt, yt, **fit_params).transform(Xt)

    def fit(self, X, y=None, **fit_params):
        Xt, yt, fit_params = self._pre_transform(X, y, **fit_params)
        self.steps[-1][-1].fit(Xt, yt, **fit_params)
        return self

    def _pre_transform(self, X, y=None, **fit_params):
        fit_params_steps = dict((step, {}) for step, _ in self.steps)
        for pname, pval in six.iteritems(fit_params):
            step, param = pname.split('__', 1)
            fit_params_steps[step][param] = pval
        Xt = X
        yt = y
        for name, transform in self.steps[:-1]:
            if hasattr(transform, "fit_transform"):
                res = transform.fit_transform(Xt, yt, **fit_params_steps[name])
                if len(res) == 2:
                    Xt, yt = res
                else:
                    Xt = res
            else:
                Xt = transform.fit(Xt, y, **fit_params_steps[name]) \
                              .transform(Xt)
        return Xt, yt, fit_params_steps[self.steps[-1][0]]

if __name__ == '__main__':
    X = np.random.random((100,30))
    y = np.random.random_integers(0, 1, 100)
    pipe = PipelineWithSampleFiltering([('flt', SampleAndFeatureFilter()), ('cls', GaussianNB())])
    X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size = 0.3, random_state = 42)
    kfold = cross_validation.KFold(len(y_train), 10)
    clf = GridSearchCV(pipe, cv = kfold, param_grid = {'flt__perc':[10,20,30,40,50,60,70,80]}, n_jobs = 1)
    clf.fit(X_train, y_train)

That really made the job. Now, the issue is that when it comes to make a GridSearchCV, it calls the score method on each CV and the shapes do not match. How could make this trick to make the score method work? — Angelo, May 08 '20 at 07:46

eickenberg · Answer 2 · 2014-07-22T20:25:13.257

5

The scikit-learn transformer API is made for changing the features of the data (in nature and possibly in number/dimension), but not for changing the number of samples. Any transformer that drops or adds samples is, as of the existing versions of scikit-learn, not compliant with the API (possibly a future addition if deemed important).

So in view of this it looks like you will have to work your way around standard scikit-learn API.

edited Jul 22 '14 at 20:25

answered Jul 22 '14 at 20:19

eickenberg

14,152
1
48
52

Re-reading your question, I am not entirely sure if it is about this that you are asking. – eickenberg Jul 22 '14 at 20:21
Perhaps I wasn't clear, but my only problem is an API problem. I guess I was, in fact, asking for a way around the API / if there is a different API (perhaps another class except for BaseEstimator which is compliant with GridSearchCV) – Korem Jul 22 '14 at 20:59
OK, thanks for the clarification. Are you using this transformer within a `sklearn.pipeline.Pipeline`? I don't understand yet how it can be passed a `y=None` from `GridSearchCV`. – eickenberg Jul 22 '14 at 21:25
Yes, I am. sklearn runs fit on X and y and then transform only on X, but I guess I can override that implementation. – Korem Jul 22 '14 at 21:35
What would be great to have to explore the possibilities is a very simple piece of code that inherits from BaseEstimator and is put into GridSearchCV and outputs something whenever the grid search calls transform. If I have time, I may add this to my answer. At prediction, the transformer will only be provided the X and not the y, since that is what it is trying to predict. At training I would have thought it is provided with both. – eickenberg Jul 23 '14 at 11:37
I think I've found a solution. See below. – Korem Jul 23 '14 at 17:28

sklearn: Have an estimator that filters samples

2 Answers2

Linked