sklearn pipeline - how to apply different transformations on different columns

Question

I am pretty new to pipelines in sklearn and I am running into this problem: I have a dataset that has a mixture of text and numbers i.e. certain columns have text only and rest have integers (or floating point numbers).

I was wondering if it was possible to build a pipeline where I can for example call LabelEncoder() on the text features and MinMaxScaler() on the numbers columns. The examples I have seen on the web mostly point towards using LabelEncoder() on the entire dataset and not on select columns. Is this possible? If so any pointers would be greatly appreciated.

score 37 · Accepted Answer · edited Mar 02 '18 at 09:23

37

The way I usually do it is with a FeatureUnion, using a FunctionTransformer to pull out the relevant columns.

Important notes:

You have to define your functions with def since annoyingly you can't use lambda or partial in FunctionTransformer if you want to pickle your model
You need to initialize FunctionTransformer with validate=False

Something like this:

from sklearn.pipeline import make_union, make_pipeline
from sklearn.preprocessing import FunctionTransformer

def get_text_cols(df):
    return df[['name', 'fruit']]

def get_num_cols(df):
    return df[['height','age']]

vec = make_union(*[
    make_pipeline(FunctionTransformer(get_text_cols, validate=False), LabelEncoder()))),
    make_pipeline(FunctionTransformer(get_num_cols, validate=False), MinMaxScaler())))
])

edited Mar 02 '18 at 09:23

guerda

23,388
27
97
146

answered Aug 18 '16 at 02:37

maxymoo

35,286
11
92
119

Any idea why I get 'TypeError: All estimators should implement fit and transform.' if I run your code? scikit-learn 0.19.1 – Alessandro Mariani Nov 07 '17 at 11:50
1

Nevermind, the signature has been changed apparently - I've submitted an edit – Alessandro Mariani Nov 07 '17 at 12:09
How could we handle, if the there is one more feature which doesn't need any scaling along with the above? – sathyz Sep 26 '18 at 13:14

score 20 · Answer 2 · answered Oct 31 '18 at 17:13

20

Since v0.20, you can use ColumnTransformer to accomplish this.

answered Oct 31 '18 at 17:13

zachguo

6,200
5
30
31

3

Could you please provide an example? – lightbox142 May 01 '20 at 22:55

score 10 · Answer 3 · answered Apr 15 '21 at 13:12

An Example of ColumnTransformer might help you:

# FOREGOING TRANSFORMATIONS ON 'data' ...
# filter data
data = data[data['county'].isin(COUNTIES_OF_INTEREST)]

# define the feature encoding of the data
impute_and_one_hot_encode = Pipeline([
        ('impute', SimpleImputer(strategy='most_frequent')),
        ('encode', OneHotEncoder(sparse=False, handle_unknown='ignore'))
    ])

featurisation = ColumnTransformer(transformers=[
    ("impute_and_one_hot_encode", impute_and_one_hot_encode, ['smoker', 'county', 'race']),
    ('word2vec', MyW2VTransformer(min_count=2), ['last_name']),
    ('numeric', StandardScaler(), ['num_children', 'income'])
])

# define the training pipeline for the model
neural_net = KerasClassifier(build_fn=create_model, epochs=10, batch_size=1, verbose=0, input_dim=109)
pipeline = Pipeline([
    ('features', featurisation),
    ('learner', neural_net)])

# train-test split
train_data, test_data = train_test_split(data, random_state=0)
# model training
model = pipeline.fit(train_data, train_data['label'])

You can find the entire code under: https://github.com/stefan-grafberger/mlinspect/blob/19ca0d6ae8672249891835190c9e2d9d3c14f28f/example_pipelines/healthcare/healthcare.py

sklearn pipeline - how to apply different transformations on different columns

3 Answers3

Linked

Related