How does sklearn know which columns are One-Hot encoded?

Question

I have a data set where there are columns that are of type object and others of type int or float. I understand that I need to convert the object columns to dummy variables but I also have some int and float columns that represent binary data (already 0 and 1). Will sklearn interpret these columns as categorical or not? I do not want these to be treated as continuous variables.

"sklearn" has functions and classes that process data. "Sklearn" does not interpret your dataframe as such — Mad Physicist, Oct 25 '19 at 00:40

score 0 · Answer 1 · answered Oct 25 '19 at 07:44

OneHotEncoder does not process, which are the columns are categorical type. Hence, all the columns that are fed to OneHotEncoder would be converted into dummy variables.

You can refer to the examples here.

If you already have binary variables and then it doesn't make sense to create two dummy variables for it.

You can use make_column_transformer to specify the columns that you need one hot encoding.

Example:

>>> import pandas as pd
>>> X = pd.DataFrame([['Male', 0], ['Female', 1], ['Female', 0]], columns=['gender', 'groups'])
>>> from sklearn.compose import make_column_transformer
>>> ct = make_column_transformer((OneHotEncoder(),[0])) #, remainder='passthrough'
>>> ct.fit_transform(X)
array([[0., 1.],
       [1., 0.],
       [1., 0.]])

How does sklearn know which columns are One-Hot encoded?

1 Answers1