How to leave numerical columns out when using sklearn OneHotEncoder?

Question

Environment:

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier

Sample data:

X_train = pd.DataFrame({'A': ['a1', 'a3', 'a2'], 
                        'B': ['b2', 'b1', 'b3'],
                        'C': [1, 2, 3]})
y_train = pd.DataFrame({'Y': [1,0,1]})

Desired outcome: I would like to include sklearn OneHotEncoder in my pipeline in this format:

encoder = ### SOME CODE ###
scaler = StandardScaler()
model = RandomForestClassifier(random_state=0)

# This is my ideal pipeline
pipe = Pipeline([('OneHotEncoder', encoder),
                 ('Scaler', scaler),
                 ('Classifier', model)])
pipe.fit(X_train, y_train)

Challenge: OneHotEncoder is encoding everything including the numerical columns. I want to keep numerical columns as it is and encode only categorical features in an efficient way that's compatible with Pipeline().

encoder = OneHotEncoder(drop='first', sparse=False) 
encoder.fit(X_train)
encoder.transform(X_train) # Columns C is encoded - this is what I want to avoid

Work around (not ideal): I can get around the problem using pd.get_dummies(). However, this means I can't include it in my pipeline. Or is there a way?

X_train = pd.get_dummies(X_train, drop_first=True)

score 2 · Accepted Answer · answered Mar 22 '20 at 11:59

My preferred solution for this would be to use sklearn's ColumnTransformer (see here).

It enables you to split the data in as many groups as you want (in your case, categorical vs numerical data) and apply different preprocessing operations to these groups. This transformer can then be used in a pipeline as any other sklearn preprocessing tool. Here is a short example:

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

X = pd.DataFrame({"a":[1,2,3],"b":["A","A","B"]})
y = np.array([0,1,1])

OHE = OneHotEncoder()
scaler = StandardScaler()
RFC = RandomForestClassifier()

cat_cols = ["b"]
num_cols = ["a"]

transformer = ColumnTransformer([('cat_cols', OHE, cat_cols),
                                ('num_cols', scaler, num_cols)])

pipe = Pipeline([("preprocessing", transformer),
                ("classifier", RFC)])
pipe.fit(X,y)

NB: I have taken some license with your request because this only applies the scaler to the numerical data, which I believe makes more sense? If you do want to apply the scaler to all columns, you can do this as well by modifying this example.

Thank you introducing me to ColumnTransformer. Great point on scaler. I agree if I was using MinMaxScaler(). However, with StandardScaler(), one hot encoded dummies would still be transformed so that their mean is centred around 0. What's your thoughts on standardising dummies along with numerical features if I am using modelling technique that is more sensitive to feature scale, say logistic regression? — Zolzaya Luvsandorj, Mar 22 '20 at 23:51
If I still wanted to standardise everything, is this the most efficient way to tweak your code: `transformer = ColumnTransformer([('cat_cols', OHE, cat_cols)], remainder = 'passthrough')` then `pipe = Pipeline([("preprocessing", transformer), ("scaling", scaler), ("classifier", RFC)])` — Zolzaya Luvsandorj, Mar 22 '20 at 23:54
MinMaxScaler will not impact your one hot encoded columns. StandardScaler on the other hand does. I still think it does not make intuitive sense to scale these, but you can still try and see how it impacts your classification scores. And to your second question, yes, this is how you would tweak my example to apply the scaler to all features. — MaximeKan, Mar 23 '20 at 08:55

Seleme · Answer 2 · 2020-03-22T08:01:03.477

1

What I would do is to create my own custom transformer and put it into pipeline. With this way, you will have a lot of power over the data in your hand. So, the steps are like below:

1) Create a custom transformer class inheriting BaseEstimator and TransformerMixin. In its transform() function try to detect the values of that column is either numerical or categorical. If you do not want to deal with the logic right now, you can always give column name for categorical columns to your transform() function to select on the fly.

2) (Optional) Create your custom transformer to handle columns with only categorical values.

3) (Optional) Create your custom transformer to handle columns with only numerical values.

4) Build two pipelines (one for categorical, the other for numerical) using transformers you created and you can also use the existing ones from sklearn.

5) Merge two pipelines with FeatureUnion.

6) Merge your big pipeline with your ML model.

7) Call fit_transform()

The sample code (no optionals implemented): GitHub Jupyter Noteboook

edited Mar 22 '20 at 08:01

answered Mar 22 '20 at 06:16

Seleme

241
1
8

Thanks Seleme, would it be possible to include sample codes using the sample data I provided to illustrate what you mean? – Zolzaya Luvsandorj Mar 22 '20 at 06:51
@ZolzayaLuvsandorj I added a jupyter notebook link. There, you can see the transformed data set. Observe that `A` and `B` columns are `OneHotEncode`'d whereas `C` column is `StandardScale`'d – Seleme Mar 22 '20 at 08:03
@ZolzayaLuvsandorj You have to add distinctive `dtype`s that are defined in `numpy` which may be present in your DataFrame. The dict `_supported_dtypes` is responsible for mapping categorical and numerical dtypes. – Seleme Mar 22 '20 at 08:05
@ZolzayaLuvsandorj Forgot to add in the code, but you apparently can use a model on transformed `X_train`. – Seleme Mar 22 '20 at 08:08
Thanks @Seleme for your contribution, I see your recommended pipeline for one hot encoding, looks quite complex. Keeping in mind my desired outcome, how do I incorporate the FeatureUnion with my whole pipeline? Because I want to have one pipeline which includes one hot encoding for categorical variables, scaler and classifier. – Zolzaya Luvsandorj Mar 22 '20 at 09:07
@ZolzayaLuvsandorj Ok, in `encoder = ### SOME CODE ###` part, you should write your own custom one hot encoder that inherits from original one hot encoder, but during its `transform` call, you will select and transform only categorical ones just like I did in my code. You should return the other columns as they are. – Seleme Mar 22 '20 at 09:23

How to leave numerical columns out when using sklearn OneHotEncoder?

2 Answers2