Is there a scikit-learn preprocesser I can use or implement to select a subset of rows from a pandas dataframe? I would prefer a preprocesser to do this since I want to build a pipeline with this as a step.
Asked
Active
Viewed 688 times
5
-
This can be done with pandas. https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html#how-do-i-filter-specific-rows-from-a-dataframe. Are you selecting rows based on some criteria? – tdpu Mar 14 '22 at 17:06
-
yes, I do have a criteria and I want to use a preprocesser rather than pandas – user308827 Mar 14 '22 at 17:26
-
1According to this: https://stackoverflow.com/questions/25539311/custom-transformer-for-sklearn-pipeline-that-alters-both-x-and-y you should probably do it outside of `sklearn`. Can you just make a custom function to drop offending rows and insert it into your pipeline? Without some more context or example code, I don't think I can offer anything more helpful. – tdpu Mar 14 '22 at 17:52
-
1As mentioned in an answer at the linked question above, the `imblearn` package extends the `sklearn` pipeline to accommodate changing the number of rows. – Ben Reiniger Mar 14 '22 at 18:12
1 Answers
5
You can use a FunctionTransformer
to do that. To a FunctionTransformer, you can pass any Callable that exposes the same interface as standard scikitlearn transform calls have. In code
import pandas as pd
from sklearn.preprocessing import FunctionTransformer
class RowSelector:
def __init__(self, rows:list[int]):
self._rows = rows
def __call__(self, X:pd.DataFrame, y=None) -> pd.DataFrame:
return X.iloc[self._rows,:]
selector = FunctionTransformer(RowSelector(rows=[1,3]))
df = pd.DataFrame({'a':range(4), 'b':range(4), 'c':range(4)})
selector.fit_transform(df)
#Returns
a b c
1 1 1 1
3 3 3 3
Not that, I have used a callable object to track some state, i.e. the rows to be selected. This is not necessary and could be solved differently.
The cool thing is that it returns a data frame, so if you have it as the first step of your pipeline, you can also combine it with a subsequent column transformer (if needed of course)

Simon Hawe
- 3,968
- 6
- 14