5

Is there a scikit-learn preprocesser I can use or implement to select a subset of rows from a pandas dataframe? I would prefer a preprocesser to do this since I want to build a pipeline with this as a step.

user308827
  • 21,227
  • 87
  • 254
  • 417
  • This can be done with pandas. https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html#how-do-i-filter-specific-rows-from-a-dataframe. Are you selecting rows based on some criteria? – tdpu Mar 14 '22 at 17:06
  • yes, I do have a criteria and I want to use a preprocesser rather than pandas – user308827 Mar 14 '22 at 17:26
  • 1
    According to this: https://stackoverflow.com/questions/25539311/custom-transformer-for-sklearn-pipeline-that-alters-both-x-and-y you should probably do it outside of `sklearn`. Can you just make a custom function to drop offending rows and insert it into your pipeline? Without some more context or example code, I don't think I can offer anything more helpful. – tdpu Mar 14 '22 at 17:52
  • 1
    As mentioned in an answer at the linked question above, the `imblearn` package extends the `sklearn` pipeline to accommodate changing the number of rows. – Ben Reiniger Mar 14 '22 at 18:12

1 Answers1

5

You can use a FunctionTransformer to do that. To a FunctionTransformer, you can pass any Callable that exposes the same interface as standard scikitlearn transform calls have. In code

import pandas as pd
from sklearn.preprocessing import FunctionTransformer

class RowSelector:
    def __init__(self, rows:list[int]):
        self._rows = rows

    def __call__(self, X:pd.DataFrame, y=None) -> pd.DataFrame:
        return X.iloc[self._rows,:]

selector = FunctionTransformer(RowSelector(rows=[1,3]))
df = pd.DataFrame({'a':range(4), 'b':range(4), 'c':range(4)})
selector.fit_transform(df)
#Returns
   a  b  c
1  1  1  1
3  3  3  3

Not that, I have used a callable object to track some state, i.e. the rows to be selected. This is not necessary and could be solved differently.

The cool thing is that it returns a data frame, so if you have it as the first step of your pipeline, you can also combine it with a subsequent column transformer (if needed of course)

Simon Hawe
  • 3,968
  • 6
  • 14