One-hot-encoding multiple columns in sklearn and naming columns

Question

I have the following code to one-hot-encode 2 columns I have.

# encode city labels using one-hot encoding scheme
city_ohe = OneHotEncoder(categories='auto')
city_feature_arr = city_ohe.fit_transform(df[['city']]).toarray()
city_feature_labels = city_ohe.categories_
city_features = pd.DataFrame(city_feature_arr, columns=city_feature_labels)

phone_ohe = OneHotEncoder(categories='auto')
phone_feature_arr = phone_ohe.fit_transform(df[['phone']]).toarray()
phone_feature_labels = phone_ohe.categories_
phone_features = pd.DataFrame(phone_feature_arr, columns=phone_feature_labels)

What I'm wondering is how I do this in 4 lines while getting properly named columns in the output. That is, I can create a properly one-hot-encoded array by include both columns names in fit_transform but when I try and name the resulting dataframe's columns, it tells me that there is a mismatch between the shape of the indices:

ValueError: Shape of passed values is (6, 50000), indices imply (3, 50000)

For background, both phone and city have 3 values.

    city    phone
0   CityA   iPhone
1   CityB Android
2   CityB iPhone
3   CityA   iPhone
4   CityC   Android

score 12 · Accepted Answer · answered Mar 19 '19 at 01:03

12

You you are almost there... Like you said you can add all the columns you want to encode in fit_transform directly.

ohe = OneHotEncoder(categories='auto')
feature_arr = ohe.fit_transform(df[['phone','city']]).toarray()
feature_labels = ohe.categories_

And then you just need to do the following:

feature_labels = np.array(feature_labels).ravel()

Which enables you to name your columns like you wanted:

features = pd.DataFrame(feature_arr, columns=feature_labels)

answered Mar 19 '19 at 01:03

MaximeKan

4,011
11
26

1

thanks! I should have found that ravel() function. it's super helpful. – Python Developer Mar 19 '19 at 01:28
2

@MaximeKan I'm having problems when using the dataframe creation with more than one feature it returns me an error Shape of passed values is (10692, 7), indices imply (10692, 2), I'm having to do the feature labels manually, how do I to solve this – Vitor Gonçalves Jun 25 '20 at 19:11
2

@VitorGonçalves This happens because the returning dataset from `fit_transform` has 7 columns after transform and therefore Pandas expects 7 corresponding labels in the `feature_labels` array to match with the dataset, but it only has 2 elements. To fix this error, replace `feature_labels = ohe.categories_` with `feature_labels = ohe.get_feature_names()` – Pavindu Jun 16 '21 at 06:36

panktijk · Answer 2 · 2019-03-18T23:11:48.973

1

Why don't you take a look at pd.get_dummies? Here's how you can encode:

df['city'] = df['city'].astype('category')
df['phone'] = df['phone'].astype('category')
df = pd.get_dummies(df)

edited Mar 18 '19 at 23:11

answered Mar 18 '19 at 23:05

panktijk

1,574
8
10

Thanks panktijk. I ended up doing that, but I was wondering if it could be done in sklearn. – Python Developer Mar 19 '19 at 00:49
I am told that it is usually preferable to use sklearn OneHotEncoder because it assimilates better with ML workflow (e.g. you can use sklearn make_pipeline with OneHotEncoder) – Constantly confused Mar 22 '22 at 04:56

score 1 · Answer 3 · answered Oct 19 '22 at 17:14

1

this solution gives column names same as in pd.get_dummies(), what is useful IMO

labels = ['Sex', 'Embarked', 'Pclass']

categorical_data = data[labels]

ohe = OneHotEncoder(categories='auto')

feature_arr = ohe
   .fit_transform(categorical_data)
   .toarray()

ohe_labels = ohe.get_feature_names(labels)

features = pd.DataFrame(
               feature_arr,
               columns=ohe_labels)

answered Oct 19 '22 at 17:14

some_newbie

11
1

DEPRECATED: get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead. – Joe Ferndz Aug 04 '23 at 07:47

score 0 · Answer 4 · answered May 01 '21 at 06:58

0

cat_features = [
    "gender", "cholesterol", "gluc", "smoke", "alco"
]

data = pd.get_dummies(data, columns = cat_features)

answered May 01 '21 at 06:58

naimur978

144
8

One-hot-encoding multiple columns in sklearn and naming columns

4 Answers4

Linked