26

I'm trying to convert categorical value (in my case it is country column) into encoded value using LabelEncoder and then with OneHotEncoder and was able to convert the categorical value. But i'm getting warning like OneHotEncoder 'categorical_features' keyword is deprecated "use the ColumnTransformer instead." So how i can use ColumnTransformer to achieve same result ?

Below is my input data set and the code which i tried

Input Data set

Country Age Salary
France  44  72000
Spain   27  48000
Germany 30  54000
Spain   38  61000
Germany 40  67000
France  35  58000
Spain   26  52000
France  48  79000
Germany 50  83000
France  37  67000


import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

#X is my dataset variable name

label_encoder = LabelEncoder()
x.iloc[:,0] = label_encoder.fit_transform(x.iloc[:,0]) #LabelEncoder is used to encode the country value
hot_encoder = OneHotEncoder(categorical_features = [0])
x = hot_encoder.fit_transform(x).toarray()

And the output i'm getting as, How can i get the same output with column transformer

0(fran) 1(ger) 2(spain) 3(age)  4(salary)
1         0       0      44        72000
0         0       1      27        48000
0         1       0      30        54000
0         0       1      38        61000
0         1       0      40        67000
1         0       0      35        58000
0         0       1      36        52000
1         0       0      48        79000
0         1       0      50        83000
1         0       0      37        67000

i tried following code

from sklearn.compose import ColumnTransformer, make_column_transformer

preprocess = make_column_transformer(

    ( [0], OneHotEncoder())
)
x = preprocess.fit_transform(x).toarray()

i was able to encode country column with the above code, but missing age and salary column from x varible after transforming

chinna g
  • 405
  • 1
  • 4
  • 6
  • transformer = ColumnTransformer( transformers=[ ("Country", # Just a name OneHotEncoder(), # The transformer class [0] # The column(s) to be applied on. ) ], remainder='passthrough' ) X = transformer.fit_transform(X) – Swarit Agarwal Sep 04 '19 at 10:41
  • Some issues/suggestions in your code/approach: 1. You don't need a Label Encoder (ideally, it's for response variable). Refer: https://stackoverflow.com/a/63822728/5114585 2. You can directly use One Hot Encoder [To be continued..] – Dr Nisha Arora Sep 16 '20 at 03:01
  • 3. For this data, you can directly also pick categorical column but to automate task of applying OHE on all categorical columns, you can use ColumnTransformer() or make_column_transfer [They are slightly different. ColumnTransformer requires the naming of steps, make_column_transformer does not] 4. Selecting categorical variables for column transformer can be done in various ways such as using column names, index, data type, etc. [refer sklearn document to know more] – Dr Nisha Arora Sep 16 '20 at 03:01

9 Answers9

31

It is a bit strange to encode continuous data as Salary. It makes no sense unless you have binned your salary to certain ranges/categories. If I were you I would do:

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder



numeric_features = ['Salary']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['Age','Country']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

from here you can pipe it with a classifier e.g.

clf = Pipeline(steps=[('preprocessor', preprocessor),
                  ('classifier', LogisticRegression(solver='lbfgs'))])  
                  

Use it as so:

clf.fit(X_train,y_train)

this will apply the preprocessor and then pass transformed data to the predictor.

Updates:

If we want to select data types on the fly, we can modify our preprocessor to use column selector by data dtypes:

from sklearn.compose import make_column_selector as selector

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, selector(dtype_include="numeric")),
        ('cat', categorical_transformer, selector(dtype_include="category"))])

Using GridSearch

param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'classifier__C': [0.1, 1.0, 10, 100],
    'classifier__solver': ['lbfgs', 'sag'],
}

grid_search = GridSearchCV(clf, param_grid, cv=10)
grid_search.fit(X_train,y_train)

Getting names of features


preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, selector(dtype_include="numeric")),
        ('cat', categorical_transformer, selector(dtype_include="category"))],
    verbose_feature_names_out=False, # added this line
)

# now we can access feature names with

clf[:-1]. get_feature_names_out() # step before estimator

Prayson W. Daniel
  • 14,191
  • 4
  • 51
  • 57
  • 3
    How you do combine pipeline with "GridSearchCV", and still can print the best score and best param? – hudarsono May 20 '19 at 02:45
  • How does sklearn know which column is numerical and which is categorical. – Pogger Apr 03 '21 at 04:37
  • We have pass them ourselves `numeric_features = ['Salary']` etc. We can also use `from sklearn.compose import make_column_selector` and select only numeric values see https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py – Prayson W. Daniel Apr 03 '21 at 04:48
  • Hi @PraysonW.Daniel, I have found your answer to this question very useful for my current problem. May I ask you to have a look, if you do not mind? https://stackoverflow.com/questions/67493509/pre-processing-resampling-and-pipelines (maybe you might help me to figure it out what I am doing wrong). Thanks a lot – Math May 12 '21 at 11:47
13

I think the poster is not trying to transform the Age and Salary. From the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html), you ColumnTransformer (and make_column_transformer) only columns specified in the transformer (i.e., [0] in your example). You should set remainder="passthrough" to get the rest of the columns. In other words:

preprocessor = make_column_transformer( (OneHotEncoder(),[0]),remainder="passthrough")
x = preprocessor.fit_transform(x)
passerby
  • 131
  • 2
  • How do you tackle this warning? FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values. If you want the future behaviour and silence this warning, you can specify "categories='auto'". In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly. warnings.warn(msg, FutureWarning) – Fawwaz Yusran Jun 18 '19 at 02:35
  • bro @FawwazYusran ... just comment the lines containing labelEncoder.... directly use the suggestion of passerby – Sandipan Majhi Aug 31 '19 at 20:34
5

Simplest Method is use pandas dummies on your CVS Data Frame

dataset = pd.read_csv("yourfile.csv")
dataset = pd.get_dummies(dataset,columns=['Country'])

finished Your dataset will look like this Output

Shiva_Adasule
  • 769
  • 2
  • 9
  • 14
  • 1
    I have been working with sklearn for over a year and I did not know this. Thanks a lot! – Dominik Novotný Feb 04 '20 at 06:21
  • 2
    It might be easier to use pandas for preprocessing data however using sklearn for the same has advantages as preprocessing steps can be used in pipelines and later can be used to cross-validate model performance – Dr Nisha Arora Sep 16 '20 at 02:49
  • @Shiva_Adasule I tried this but got the same error as when I was before the conversion. The error states: ValueError: could not convert string to float: 'Honda' – Web Development Labs Oct 09 '20 at 21:41
2
from sklearn.compose import make_column_transformer
preprocess = make_column_transformer(
    (OneHotEncoder(categories='auto'), [0]), 
    remainder="passthrough")
X = preprocess.fit_transform(X)

I fixed the same issue using the above code.

Nicolás Ozimica
  • 9,481
  • 5
  • 38
  • 51
Arvind Chavhan
  • 488
  • 1
  • 6
  • 14
1

@Fawwaz Yusran To tackle this warning...

FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values. If you want the future behaviour and silence this warning, you can specify "categories='auto'". In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly. warnings.warn(msg, FutureWarning)

Remove the following...

labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

Since you are using OneHotEncoder directly you don't need LabelEncoder.

Yung
  • 11
  • 1
1

You can directly use the OneHotEncoder and doesn't need to use LabelEncoder

#  Encoding categorical data
from sklearn.preprocessing import OneHotEncoder
transformer = ColumnTransformer(
    transformers=[
        ("OneHotEncoder",
         OneHotEncoder(),
         [0]              # country column or the column on which categorical operation to be performed
         )
    ],
    remainder='passthrough'
)
X = transformer.fit_transform(X.tolist())
Suresh Mangs
  • 705
  • 8
  • 19
0

Since you are transforming only country column (i.e., [0] in your example). Use remainder="passthrough" to get remaining columns so that you will get those columns as it is.

try:

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
labelencoder=LabelEncoder()
x[:,0]=labelencoder.fit_transform(x[:,0])
preprocess = ColumnTransformer(transformers=[('onehot', OneHotEncoder() 
                               [0])],remainder="passthrough")
x = np.array(preprocess.fit_transform(x), dtype=np.int)
Reegan Miranda
  • 2,879
  • 6
  • 43
  • 55
Lekha
  • 1
  • 1
0
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_previsores = LabelEncoder()

onehotencorder = ColumnTransformer(transformers=[("OneHot", OneHotEncoder(), [0])],remainder='passthrough')
x= onehotencorder.fit_transform(x).toarray()

the great advantage of OneHotEnocoder is to convert several columns at once, see the example passing several columns

onehotencorder = ColumnTransformer(transformers=[("OneHot", OneHotEncoder(), [1,3,5,6,7,8,9,13])],remainder='passthrough')

if it's a single column, you can do it the traditional way

from sklearn.preprocessing import LabelEncoder
labelencoder_predictors = LabelEncoder()
x[:,0] = labelencoder_predictors.fit_transform(x[:,0])

another suggestion.

Do not use variables with the name of x, y, z put what it represents, example: predictors, classes, countries, ecc.

Dorathoto
  • 201
  • 7
  • 17
0
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
print(X[:, 0])
ct = ColumnTransformer([("Country", OneHotEncoder(), [1])], remainder = 'passthrough')
#onehotencoder = OneHotEncoder(categorical_features = [0])
X = ct.fit_transform(X).toarray()