How to use sklearn Column Transformer?

Question

I'm trying to convert categorical value (in my case it is country column) into encoded value using LabelEncoder and then with OneHotEncoder and was able to convert the categorical value. But i'm getting warning like OneHotEncoder 'categorical_features' keyword is deprecated "use the ColumnTransformer instead." So how i can use ColumnTransformer to achieve same result ?

Below is my input data set and the code which i tried

Input Data set

Country Age Salary
France  44  72000
Spain   27  48000
Germany 30  54000
Spain   38  61000
Germany 40  67000
France  35  58000
Spain   26  52000
France  48  79000
Germany 50  83000
France  37  67000


import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

#X is my dataset variable name

label_encoder = LabelEncoder()
x.iloc[:,0] = label_encoder.fit_transform(x.iloc[:,0]) #LabelEncoder is used to encode the country value
hot_encoder = OneHotEncoder(categorical_features = [0])
x = hot_encoder.fit_transform(x).toarray()

And the output i'm getting as, How can i get the same output with column transformer

0(fran) 1(ger) 2(spain) 3(age)  4(salary)
1         0       0      44        72000
0         0       1      27        48000
0         1       0      30        54000
0         0       1      38        61000
0         1       0      40        67000
1         0       0      35        58000
0         0       1      36        52000
1         0       0      48        79000
0         1       0      50        83000
1         0       0      37        67000

i tried following code

from sklearn.compose import ColumnTransformer, make_column_transformer

preprocess = make_column_transformer(

    ( [0], OneHotEncoder())
)
x = preprocess.fit_transform(x).toarray()

i was able to encode country column with the above code, but missing age and salary column from x varible after transforming

transformer = ColumnTransformer( transformers=[ ("Country", # Just a name OneHotEncoder(), # The transformer class [0] # The column(s) to be applied on. ) ], remainder='passthrough' ) X = transformer.fit_transform(X) — Swarit Agarwal, Sep 04 '19 at 10:41
Some issues/suggestions in your code/approach: 1. You don't need a Label Encoder (ideally, it's for response variable). Refer: https://stackoverflow.com/a/63822728/5114585 2. You can directly use One Hot Encoder [To be continued..] — Dr Nisha Arora, Sep 16 '20 at 03:01
3. For this data, you can directly also pick categorical column but to automate task of applying OHE on all categorical columns, you can use ColumnTransformer() or make_column_transfer [They are slightly different. ColumnTransformer requires the naming of steps, make_column_transformer does not] 4. Selecting categorical variables for column transformer can be done in various ways such as using column names, index, data type, etc. [refer sklearn document to know more] — Dr Nisha Arora, Sep 16 '20 at 03:01

Prayson W. Daniel · Answer 1 · 2022-02-20T12:22:57.413

It is a bit strange to encode continuous data as Salary. It makes no sense unless you have binned your salary to certain ranges/categories. If I were you I would do:

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder



numeric_features = ['Salary']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['Age','Country']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

from here you can pipe it with a classifier e.g.

clf = Pipeline(steps=[('preprocessor', preprocessor),
                  ('classifier', LogisticRegression(solver='lbfgs'))])

Use it as so:

clf.fit(X_train,y_train)

this will apply the preprocessor and then pass transformed data to the predictor.

Updates:

If we want to select data types on the fly, we can modify our preprocessor to use column selector by data dtypes:

from sklearn.compose import make_column_selector as selector

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, selector(dtype_include="numeric")),
        ('cat', categorical_transformer, selector(dtype_include="category"))])

Using GridSearch

param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'classifier__C': [0.1, 1.0, 10, 100],
    'classifier__solver': ['lbfgs', 'sag'],
}

grid_search = GridSearchCV(clf, param_grid, cv=10)
grid_search.fit(X_train,y_train)

Getting names of features


preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, selector(dtype_include="numeric")),
        ('cat', categorical_transformer, selector(dtype_include="category"))],
    verbose_feature_names_out=False, # added this line
)

# now we can access feature names with

clf[:-1]. get_feature_names_out() # step before estimator

How you do combine pipeline with "GridSearchCV", and still can print the best score and best param? — hudarsono, May 20 '19 at 02:45
How does sklearn know which column is numerical and which is categorical. — Pogger, Apr 03 '21 at 04:37
We have pass them ourselves `numeric_features = ['Salary']` etc. We can also use `from sklearn.compose import make_column_selector` and select only numeric values see https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py — Prayson W. Daniel, Apr 03 '21 at 04:48
Hi @PraysonW.Daniel, I have found your answer to this question very useful for my current problem. May I ask you to have a look, if you do not mind? https://stackoverflow.com/questions/67493509/pre-processing-resampling-and-pipelines (maybe you might help me to figure it out what I am doing wrong). Thanks a lot — Math, May 12 '21 at 11:47

score 13 · Answer 2 · answered Feb 07 '19 at 23:58

13

I think the poster is not trying to transform the Age and Salary. From the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html), you ColumnTransformer (and make_column_transformer) only columns specified in the transformer (i.e., [0] in your example). You should set remainder="passthrough" to get the rest of the columns. In other words:

preprocessor = make_column_transformer( (OneHotEncoder(),[0]),remainder="passthrough")
x = preprocessor.fit_transform(x)

answered Feb 07 '19 at 23:58

passerby

131
2

How do you tackle this warning? FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values. If you want the future behaviour and silence this warning, you can specify "categories='auto'". In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly. warnings.warn(msg, FutureWarning) – Fawwaz Yusran Jun 18 '19 at 02:35
bro @FawwazYusran ... just comment the lines containing labelEncoder.... directly use the suggestion of passerby – Sandipan Majhi Aug 31 '19 at 20:34

score 5 · Answer 3 · answered Dec 12 '19 at 15:00

5

Simplest Method is use pandas dummies on your CVS Data Frame

dataset = pd.read_csv("yourfile.csv")
dataset = pd.get_dummies(dataset,columns=['Country'])

finished Your dataset will look like this

answered Dec 12 '19 at 15:00

Shiva_Adasule

769
2
9
14

1

I have been working with sklearn for over a year and I did not know this. Thanks a lot! – Dominik Novotný Feb 04 '20 at 06:21
2

It might be easier to use pandas for preprocessing data however using sklearn for the same has advantages as preprocessing steps can be used in pipelines and later can be used to cross-validate model performance – Dr Nisha Arora Sep 16 '20 at 02:49
@Shiva_Adasule I tried this but got the same error as when I was before the conversion. The error states: ValueError: could not convert string to float: 'Honda' – Web Development Labs Oct 09 '20 at 21:41

score 2 · Answer 4 · edited Nov 13 '19 at 23:16

2

from sklearn.compose import make_column_transformer
preprocess = make_column_transformer(
    (OneHotEncoder(categories='auto'), [0]), 
    remainder="passthrough")
X = preprocess.fit_transform(X)

I fixed the same issue using the above code.

edited Nov 13 '19 at 23:16

Nicolás Ozimica

9,481
5
38
51

answered Sep 21 '19 at 19:32

Arvind Chavhan

488
1
6
14

score 1 · Answer 5 · answered Jun 18 '19 at 04:02

@Fawwaz Yusran To tackle this warning...

FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values. If you want the future behaviour and silence this warning, you can specify "categories='auto'". In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly. warnings.warn(msg, FutureWarning)

Remove the following...

labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

Since you are using OneHotEncoder directly you don't need LabelEncoder.

labelencoder is only for label, y/target not X/features ;) – Prayson W. Daniel Feb 20 '22 at 12:24 — Prayson W. Daniel, Feb 20 '22 at 12:24

score 1 · Answer 6 · answered Jan 29 '20 at 14:13

You can directly use the OneHotEncoder and doesn't need to use LabelEncoder

#  Encoding categorical data
from sklearn.preprocessing import OneHotEncoder
transformer = ColumnTransformer(
    transformers=[
        ("OneHotEncoder",
         OneHotEncoder(),
         [0]              # country column or the column on which categorical operation to be performed
         )
    ],
    remainder='passthrough'
)
X = transformer.fit_transform(X.tolist())

score 0 · Answer 7 · edited Dec 05 '19 at 14:51

Since you are transforming only country column (i.e., [0] in your example). Use remainder="passthrough" to get remaining columns so that you will get those columns as it is.

try:

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
labelencoder=LabelEncoder()
x[:,0]=labelencoder.fit_transform(x[:,0])
preprocess = ColumnTransformer(transformers=[('onehot', OneHotEncoder() 
                               [0])],remainder="passthrough")
x = np.array(preprocess.fit_transform(x), dtype=np.int)

score 0 · Answer 8 · answered Aug 08 '20 at 23:53

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_previsores = LabelEncoder()

onehotencorder = ColumnTransformer(transformers=[("OneHot", OneHotEncoder(), [0])],remainder='passthrough')
x= onehotencorder.fit_transform(x).toarray()

the great advantage of OneHotEnocoder is to convert several columns at once, see the example passing several columns

onehotencorder = ColumnTransformer(transformers=[("OneHot", OneHotEncoder(), [1,3,5,6,7,8,9,13])],remainder='passthrough')

if it's a single column, you can do it the traditional way

from sklearn.preprocessing import LabelEncoder
labelencoder_predictors = LabelEncoder()
x[:,0] = labelencoder_predictors.fit_transform(x[:,0])

another suggestion.

Do not use variables with the name of x, y, z put what it represents, example: predictors, classes, countries, ecc.

score 0 · Answer 9 · answered Feb 08 '21 at 19:58

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
print(X[:, 0])
ct = ColumnTransformer([("Country", OneHotEncoder(), [1])], remainder = 'passthrough')
#onehotencoder = OneHotEncoder(categorical_features = [0])
X = ct.fit_transform(X).toarray()

How to use sklearn Column Transformer?

9 Answers9

Updates:

Linked